Multiple Start URLs

AndyM · June 25, 2018, 2:03pm

I'm trying to scrape the details of pubs across the UK from whatpub.com. I've created the URLs to search by using UK postal districts and their centroid lat/longs, and want to enter many at the same time in order to make the process quicker and more automated.

However, web scraper doesn't seem to export all the pubs - it'll do all 54 for DL7 for example, then only 16 for DL6. Could it be that the web scraper dedupes the list - ie if a pub appears in both lists, it would only scrape it once?

Thanks in advance for any help!

https://whatpub.com/search?q=DL6&t=pc&p=[001-08]&lat=54.3683274169935&lng=-1.39343008888889&_token=IjzU1dLHw4SmMgwFtdXWzYAw6az3XCID5WJfxpYL&features=Open

https://whatpub.com/search?q=DL7&t=pc&p=[001-08]&lat=54.3369054905462&lng=-1.47934658928571&_token=IjzU1dLHw4SmMgwFtdXWzYAw6az3XCID5WJfxpYL&features=Open

Sitemap:
{"_id":"camra","startUrl":["https://whatpub.com/search?q=DL6&t=pc&p=[001-08]&lat=54.3683274169935&lng=-1.39343008888889&_token=IjzU1dLHw4SmMgwFtdXWzYAw6az3XCID5WJfxpYL&features=Open","https://whatpub.com/search?q=DL7&t=pc&p=[001-08]&lat=54.3369054905462&lng=-1.47934658928571&_token=IjzU1dLHw4SmMgwFtdXWzYAw6az3XCID5WJfxpYL&features=Open"],"selectors":[{"id":"pub","type":"SelectorLink","selector":"h2 a","parentSelectors":["_root","pagination"],"multiple":true,"delay":0},{"id":"name","type":"SelectorText","selector":"section h1","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","selector":"div.pub_left section:nth-of-type(1) p.no_margin-top","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"lat long","type":"SelectorLink","selector":"section.view_on_map a","parentSelectors":["pub"],"multiple":false,"delay":0},{"id":"owner","type":"SelectorText","selector":"section:nth-of-type(8) p.no_margin-top","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"last survey","type":"SelectorText","selector":"section#pub-footer p:nth-of-type(1)","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"facilities","type":"SelectorText","selector":"div#about-tab.ui-tabs-panel div.pub_right section:nth-of-type(3)","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"features","type":"SelectorText","selector":"div#about-tab.ui-tabs-panel div.pub_right section:nth-of-type(2)","parentSelectors":["pub"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"p.numbers a","parentSelectors":["_root"],"multiple":true,"delay":0}]}

iconoclast · June 25, 2018, 2:37pm

Hi!

WebScraper will import same URLs, but will skip duplicates afterwards. All URLs must be different, unfortunately.

AndyM · June 25, 2018, 2:57pm

Fantastic thanks - skipping the duplicate pubs saves me time!

I was just concerned the number of records was different for each district, so thought that the program might be missing out some of the pubs I need to collect.