Scraping taking too much time

suppose i have 500 links and delay time is 2 sec , and page renders/load instantly bcoz net is pretty good. the whole scraping should take 1/3 of an hour but it is taking almost 1 day and when i check scraping , it is going very fast. what could be the reason?

I would say, that either the sitemap is set-up incorrectly or there are more than 500 links, as the numbers that you have provided does not add up.

If you are certain that the this is how long the scraper is taking, you can just sit through the scraping job which will take under 20 minutes and troubleshoot the issue.

{"_id":"amazon_authors","startUrl":[""],"selectors":[{"id":"111","type":"SelectorLink","parentSelectors":["_root"],"selector":".body-display a","multiple":true,"delay":0},{"id":"number of books pages","type":"SelectorText","parentSelectors":["111"],"selector":".a-pagination li:nth-of-type(5) a","multiple":false,"regex":"","delay":0},{"id":"rank","type":"SelectorGroup","parentSelectors":["111"],"selector":"span.rank, .nodeRank > a, b a","delay":0,"extractAttribute":""},{"id":"about ","type":"SelectorText","parentSelectors":["111"],"selector":"span#author_biography","multiple":false,"regex":"","delay":0}]}

500 is an arbitrary number i gave.. links are 1540.. still not adding up to what time shud they be taken but much more

I'm having a similar problem. About 2 weeks ago, the process was moving pretty quickly, but then something must have changed because now it takes so long that it's timing out and nothing is getting scrapped. I sent a bug report, but no response yet.

If the issue is occurring with Amazon, then almost centrally the explanation is that your IP address is being blocked by Amazon, as to scrape it, you need a Proxy.

1 Like

I find that for scraping e-comm sites, slow and steady works best. Yea I know that will add to the scraping time, but that is better than getting banned.

I suggest you increase both Request interval and Page load delay. You may think it'll slow you down, but it could actually produce more consistent results and prevent bot detection. Try:

Request interval (ms)
Page load delay (ms)