Randomized Delay Ranges

jeremyrem · September 11, 2018, 2:46pm

Would like to be able to set a range for my delay, and have webscraper randomly select it for each page.

Having trouble with a certain website I scrape a lot, so far they have banned 10 of my proxies.

Right now I have my delay set at 7000 / 5000, but would love to have it change randomly

bretfeig · September 11, 2018, 5:11pm

LI or Angels list? Just a guess

jasond · September 11, 2018, 5:19pm

To simulate randomization, would you be able to add a new selector looking for some attribute text that may or may not exist, and delay it by say 1000ms? Such text string would depend on the Webpage.

The CSS selctor might be looking for "003" in any the ID attribute for any division, or some HREF that ends in "a.htm", such as:

div[id*='003']
a[href$='a.htm']

This selector should be a sibling of other valid selectors, so that the scrape doesn't get broken if this text is not found in the attributes. The DIV above, if it exists, should contain short text, so as not to clutter up the results.

If you set several of these "contingent" text selectors with different delays, would that simulate some randomization?

I haven't tested this. Just an idea that has been bugging me too.

Edit: I am not sure whether a null result will still spend the same delay time. Perhaps adding a child under that "contingent" element (perhaps a link or "element" selector) will actually create that delay.

jeremyrem · September 11, 2018, 5:19pm

Neither, its a website that lists doctors & various medical entities.

jeremyrem · September 11, 2018, 5:33pm

I'm not sure that would work well for my use case.

All the pages follow the same template, the only way I can see this working is if I use regex to extend the delay on certain words but they would certainly see the pattern.

bretfeig · September 11, 2018, 8:02pm

Shoot me the URL. I have a few scrapers I'm playing with. Let me see how long I can go before they block my IP.

I did a video tutorial on scraping doctors/nurses from major hospitals. It was a throwback to article written about about robots.txt to source...

I saved all the outputs and made them available

Based on what you're posting, it's very basic - meant for beginners

(https://www.youtube.com/watch?v=yhG9Pk1ShvY)

Either way, enjoy.

iconoclast · September 12, 2018, 12:44am

There were earlier versions of WebScraper in collaboration with Jens Willmer that had randomization, for some reason it was removed.

You can add random delay using Tampermonkey extension though.

I've even found someone asking for a delay over stackoverflow here.

leemeng · May 15, 2020, 7:16am

I've been thinking about this and you could probably simulate pseudo-randomness by using Element Click and its delay feature.

Say you're scraping New York city company info pages. All the addresses will have zip codes like 10xxx. You could set an Element Click, 10 sec delay only when the zip code is 10001. The selector would be something like:

div.zipcode:contains('10001')

wsdc · April 12, 2021, 12:58am

and you can change delay while scrae is doing