Hey hey,
what I try to do:
I am using the german yellow pages to collect address data from a search keyword. This works pretty well with this sitemap:
URL: https://www.gelbeseiten.de/Suche/arbeitsbühne/Berlin?umkreis=50000
Sitemap:
{"_id":"yellowpages_elements_noexternal","startUrl":["https://www.gelbeseiten.de/Suche/arbeitsbühne/Berlin?umkreis=50000"],"selectors":[{"id":"elements","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"article","multiple":true,"delay":"1000","clickElementSelector":"a.mod-LoadMore--button","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"company","type":"SelectorText","parentSelectors":["elements"],"selector":"h2","multiple":false,"regex":"","delay":0},{"id":"category","type":"SelectorText","parentSelectors":["elements"],"selector":"p.d-inline-block","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["elements"],"selector":"p[data-wipe-name='Adresse']","multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","parentSelectors":["elements"],"selector":"p.mod-AdresseKompakt__phoneNumber","multiple":false,"regex":"","delay":0},{"id":"homepage","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.contains-icon-homepage","multiple":false,"delay":0},{"id":"email","type":"SelectorText","parentSelectors":["elements"],"selector":"a.contains-icon-email","multiple":false,"regex":"","delay":0}]}
In this search I have 51 hits and after scraping the site I have 51 entries with address data in my CSV. Check. Works.
But I want more. While scraping I want to enter the every homepage from every hit and try to get informations about social media accounts (facebook, twitter, linkedin etc...). To do so I just add a link selector "a[href*="facebook.com"] as child from the link selector homepage (as seen in the following sitemap):
Url: https://www.gelbeseiten.de/Suche/arbeitsbühne/Berlin?umkreis=50000
Sitemap:
{"_id":"yellowpages_elements_external","startUrl":["https://www.gelbeseiten.de/Suche/arbeitsbühne/Berlin?umkreis=50000"],"selectors":[{"id":"elements","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"article","multiple":true,"delay":"1000","clickElementSelector":"a.mod-LoadMore--button","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"company","type":"SelectorText","parentSelectors":["elements"],"selector":"h2","multiple":false,"regex":"","delay":0},{"id":"category","type":"SelectorText","parentSelectors":["elements"],"selector":"p.d-inline-block","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["elements"],"selector":"p[data-wipe-name='Adresse']","multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","parentSelectors":["elements"],"selector":"p.mod-AdresseKompakt__phoneNumber","multiple":false,"regex":"","delay":0},{"id":"homepage","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.contains-icon-homepage","multiple":false,"delay":0},{"id":"email","type":"SelectorText","parentSelectors":["elements"],"selector":"a.contains-icon-email","multiple":false,"regex":"","delay":0},{"id":"get_facebook","type":"SelectorLink","parentSelectors":["homepage"],"selector":"a[href*="facebook.com"]","multiple":false,"delay":0}]}
Now things happen that I don't understand: I see the scraper is visiting all external sites in the popup but I end up with just around 30 scraping results. And if I restart this scraping job I even get different results if I execute the exact same job a second time (a few minutes after)...
Can anyone please help?