Spawn new session for scraping

JMThomas · January 24, 2019, 6:20pm

I have a site which decided to block me, not showing me any content when I try to scrape it: https://avvo.com

I can successfully navigate the site manually with an Incognito window, or with another window with a different session (using the Sessionbox Chrome extension). After clearing my session (using the Chrome Clear Session extension), I can manually navigate the site in the main Chrome browser. I can also manually navigate the site in a different browser (like IE).

But even after clearing my session and restarting the browser, the site still blocks WebScraper from accessing the site during scraping. Although I'm not sure how to verify this, from the behavior it appears that WebScraper is reusing the same session for scraping (even after a browser restart), and either some cookie in that session or that session ID has been flagged by the site and blocked permanently.

So it appears that some way to change the session used by WebScraper is necessary to work around blocks like this.

Possible Solutions:

1. Have a checkbox on the Metadata screen that tells WebScraper to spawn a new session for the scrape.

I realize there will be some sites where this isn't a good idea (since, for example, the user may need to login to the site in the same session WebScraper uses before starting the scrape). But a simple checkbox on the Metadata screen which allows you to have WebScraper spawn a new session for each scrape would help prevent sites blocking sessions.

Note that the checkbox could also be on the Scrape screen, but that would likely be very annoying if you have to run the scrape repeatedly, since the Scrape screen doesn't remember your choices. So the Metadata screen would be idea for this.

2. Have a field on the Scrape screen where you can enter Chrome command line arguments.

This would allow you, for example, to enter the --incognito flag for the scrape, which would (at least in my case) solve this problem.

However, in addition to likely being one of the simplest ways to solve the session problem, this feature would also allow the use of other command line options which might be useful for solving other scraping problems. For example, the ability to specify which Chrome profile to use could be very helpful, and the ability to add options like --disable-plugins, --disable-translate, and --mute-audio might come in handy when scraping some sites.

Thanks,
John
johnmichaelthomas519@gmail.com

martins · January 25, 2019, 1:02pm

Web Scraper uses users session. Chrome Api doesn't allow to run it in a different session. The only option in chrome API is that it could be launched in incognito mode.

You could achieve something similar by creating a chrome profile (Manage people) for web scraper. Before each launch you should wipe out all browsers history. You can use another extension for the wipe so that you don't have to go into browser settings.

Command line arguments cannot be configured from an extension. These need to be specified during browser launch. Create a shortcut that launches browser with the desired command line parameters.