As of August 2024, YouTube updated it's page content loading such that if you attempt to scrape captions by fetching the content of a video page from a server, there will be no captions available. This would be a shut and done case IF it wasn't also true that scrapers still function from MY LOCAL ENVIRONMENT
There is a node package called youtube-caption-scraper
which just does a simple fetch on the HTML content of a video page, pulls the language of choice (or auto-generated captions) and returns it. This package works great if I'm running the code from my own PC, but doesn't work when run from deployed code somewhere.
ALSO I can do a normal fetch from a script locally without any packages and see the caption text right there in the resulting data. So my question stands... is it really impossible to scrape from an automated app/server? I've tried:
- Running the script from a raspberry pi to emulate a local environment (didn't work)
- Manipulating my headers when sending the request to make YouTube think I'm a PC and not a server (didn't work)
- Using a YouTube video downloading library (youtube-dl-exec) to try and only extract the subtitles .vtt file (worked, but got rate limited after 5 tries)
Any ideas from a different perspective are appreciated, I've banged my head enough over this.