Scraper skips data

I am scraping mywed.com for urls of websites of photographers. I set it to scrape pages 11 to 640, but it returned only around 16k results, when it should be 18k. Why does it happen? How can I find out what it skipped and why?

Hello
increase the delay

1 Like

hello! :slight_smile:
which one?

in the parameters of the selectors that you used.
As you didn't post your sitemap, it's difficult to help you more

{"_id":"mywed","startUrl":["https://mywed.com/ru/Israel-wedding-photographers/p[2-7]/"],"selectors":[{"id":"photographer","type":"SelectorLink","selector":"div.photographer_name_geo > a","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"url","type":"SelectorLink","selector":"span a.nui-href","parentSelectors":["photographer"],"multiple":false,"delay":0}]}

Posted the sitemap above, it's for one country. Even if I set both delays to 6000 it still misses some of the urls

Hi,

when the web site of the photographer does not exist more, the scrapper miss one record during the scrapping. Your browser gives you the message : "this site is inaccessible".

So to avoid to loose records, you must change the selector of the URL: from Link to HTML.

{"_id":"test","startUrl":["https://mywed.com/ru/Israel-wedding-photographers/p[2-2]"],"selectors":[{"id":"photographer","type":"SelectorLink","selector":"div.photographer_name_geo > a","parentSelectors":["_root"],"multiple":true,"delay":"0"},{"id":"tel","type":"SelectorText","selector":"a.nui-profile-phone","parentSelectors":["photographer"],"multiple":false,"regex":"","delay":0},{"id":"Website","type":"SelectorHTML","selector":"span a.nui-href","parentSelectors":["photographer"],"multiple":false,"regex":"","delay":0}]}

But when there's no site url present, it outputs "null", but in some cases it misses a page completely. For some pages it does like 12 out of 30, for some does 30 out of 30, not sure on what it depends.

I tested page 2 and page 3 and collected 30+30 = 60 records without problem with the sitemap I gave you and without loosing data.
Try to test more and detect on which page you loose data.

i used my sitemap on 7 pages, but changed url from link to html, no loss, now trying on 640 with 2000 delays

so, it doesn't help on large volumes. the result for all 640 pages is the same as with text url - 2000 people are skipped. And I don't see any dependencies. This page gives 30/30, but the next two (328 and 329) give 27 and 24 respectively.

I use link because I need the link, not the text itself, because some people have links to their instagrams, and if I use text or HTML it gives me instagram.com (as it appears on the page), not the link itself.

Also, if I scrape the pages with missed people, it scrapes all of it. It only skips with large volumes.

Hi,
i have similar problem. You had solved it?