Scraper skips data

unnanego · May 18, 2018, 9:16am

I am scraping mywed.com for urls of websites of photographers. I set it to scrape pages 11 to 640, but it returned only around 16k results, when it should be 18k. Why does it happen? How can I find out what it skipped and why?

chefas · May 18, 2018, 9:40am

Hello
increase the delay

unnanego · May 18, 2018, 9:52am

hello!
which one?

chefas · May 18, 2018, 2:09pm

in the parameters of the selectors that you used.
As you didn't post your sitemap, it's difficult to help you more

unnanego · May 21, 2018, 11:36am

{"_id":"mywed","startUrl":["https://mywed.com/ru/Israel-wedding-photographers/p[2-7]/"],"selectors":[{"id":"photographer","type":"SelectorLink","selector":"div.photographer_name_geo > a","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"url","type":"SelectorLink","selector":"span a.nui-href","parentSelectors":["photographer"],"multiple":false,"delay":0}]}

unnanego · May 21, 2018, 11:37am

Posted the sitemap above, it's for one country. Even if I set both delays to 6000 it still misses some of the urls

chefas · May 21, 2018, 2:52pm

Hi,

when the web site of the photographer does not exist more, the scrapper miss one record during the scrapping. Your browser gives you the message : "this site is inaccessible".

So to avoid to loose records, you must change the selector of the URL: from Link to HTML.

{"_id":"test","startUrl":["https://mywed.com/ru/Israel-wedding-photographers/p[2-2]"],"selectors":[{"id":"photographer","type":"SelectorLink","selector":"div.photographer_name_geo > a","parentSelectors":["_root"],"multiple":true,"delay":"0"},{"id":"tel","type":"SelectorText","selector":"a.nui-profile-phone","parentSelectors":["photographer"],"multiple":false,"regex":"","delay":0},{"id":"Website","type":"SelectorHTML","selector":"span a.nui-href","parentSelectors":["photographer"],"multiple":false,"regex":"","delay":0}]}

unnanego · May 21, 2018, 3:04pm

But when there's no site url present, it outputs "null", but in some cases it misses a page completely. For some pages it does like 12 out of 30, for some does 30 out of 30, not sure on what it depends.

chefas · May 21, 2018, 3:37pm

I tested page 2 and page 3 and collected 30+30 = 60 records without problem with the sitemap I gave you and without loosing data.
Try to test more and detect on which page you loose data.

unnanego · May 21, 2018, 3:41pm

i used my sitemap on 7 pages, but changed url from link to html, no loss, now trying on 640 with 2000 delays

unnanego · May 22, 2018, 11:21am

so, it doesn't help on large volumes. the result for all 640 pages is the same as with text url - 2000 people are skipped. And I don't see any dependencies. This page gives 30/30, but the next two (328 and 329) give 27 and 24 respectively.

unnanego · May 22, 2018, 11:23am

I use link because I need the link, not the text itself, because some people have links to their instagrams, and if I use text or HTML it gives me instagram.com (as it appears on the page), not the link itself.

unnanego · May 22, 2018, 11:26am

Also, if I scrape the pages with missed people, it scrapes all of it. It only skips with large volumes.

Gigi5627 · September 5, 2018, 3:05pm

Hi,
i have similar problem. You had solved it?