I have a site which decided to block me, not showing me any content when I try to scrape it: https://avvo.com
I can successfully navigate the site manually with an Incognito window, or with another window with a different session (using the Sessionbox Chrome extension). After clearing my session (using the Chrome Clear Session extension), I can manually navigate the site in the main Chrome browser. I can also manually navigate the site in a different browser (like IE).
But even after clearing my session and restarting the browser, the site still blocks WebScraper from accessing the site during scraping. Although I'm not sure how to verify this, from the behavior it appears that WebScraper is reusing the same session for scraping (even after a browser restart), and either some cookie in that session or that session ID has been flagged by the site and blocked permanently.
So it appears that some way to change the session used by WebScraper is necessary to work around blocks like this.
1. Have a checkbox on the Metadata screen that tells WebScraper to spawn a new session for the scrape.
I realize there will be some sites where this isn't a good idea (since, for example, the user may need to login to the site in the same session WebScraper uses before starting the scrape). But a simple checkbox on the Metadata screen which allows you to have WebScraper spawn a new session for each scrape would help prevent sites blocking sessions.
Note that the checkbox could also be on the Scrape screen, but that would likely be very annoying if you have to run the scrape repeatedly, since the Scrape screen doesn't remember your choices. So the Metadata screen would be idea for this.
2. Have a field on the Scrape screen where you can enter Chrome command line arguments.
This would allow you, for example, to enter the
--incognito flag for the scrape, which would (at least in my case) solve this problem.
However, in addition to likely being one of the simplest ways to solve the session problem, this feature would also allow the use of other command line options which might be useful for solving other scraping problems. For example, the ability to specify which Chrome profile to use could be very helpful, and the ability to add options like
--mute-audio might come in handy when scraping some sites.