Cloudflare botfilter challenges

webbertoo · July 30, 2019, 1:21am

Trying to scrape a commerce site behind cloudflare proxy, and looking at what webscraper seems to do is: load browse page, collect links for detail pages, then loop though detail pages one by one.

But won't that expose bot-style requests, since the referrer now isn't the browse page anymore? Also, if the request is always exactly the same timing in ms, that also seems like a dead giveaway to bot-browsing. I wonder if there might be a way to set a range so it's randomized a little.

leemeng · August 10, 2023, 11:30pm

Sites will mainly check for excessive traffic, e.g. more than 5 page requests over 10 seconds. So you can often avoid triggering them just by using longer delays. Of course, the scraping would take longer, unless you have multiple machines/VMs.

The "request is always exactly the same timing" isn't an issue AFAIK due to Internet latency; if you do a ping test of major sites, you will rarely get the same reply time in ms. You could make pattern-finding a little harder by using non-round numbers for both Request interval and Page load delay, e.g:

Request interval: 2167
Page load delay: 4381

But yeah, a feature like wget's --random-wait would be nice.