Multiple Start URL's

How does this work? I have defined two Start URL links in the site map meta data. I save this and get presented at the _root with the option to 'Add new selector'

How are the multiple start URL's iterated. In my situation the multiple start URL's are actually the links to a returned page from different search criteria. I require the same scrap from each.

At present when I try to save the results I'm only getting the data from one of the Start URL's

1 Like

@DaveJ19 Hello, multiple start URLs for a sitemap can be added via the UI of Web Scraper Cloud(handles up to 20'000 start URLs).

I don't use Cloud. The Chrome plugin allows me to create multiple start URL's but I don't understand how they are used as my output is only ever from one of the start URL's

@DaveJ19 Could you, please, provide the sitemaps JSON?

Here's the JSON

{"_id":"test","startUrl":["The Kennel Club card","multiple":true,"parentSelectors":["_root"],"selector":"a.m-judge-card__link","type":"SelectorLink"},{"id":"name","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"phone","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Phone') + dd a","type":"SelectorText"},{"id":"address","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Address') + dd","type":"SelectorText"},{"id":"judging section","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"div.m-tabs__panel","type":"SelectorText"},{"id":"contact link","multiple":false,"parentSelectors":["judge card"],"selector":"dt:contains('Email') + dd a","type":"SelectorLink"},{"id":"judges id","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Field Trial judge ID') + dd","type":"SelectorText"}]}

I only seem to get the information from the last start URL

@DaveJ19 Hi, please, apply the preformatted text option after pasting the sitemap, otherwise the JSON is invalid due to autoformat.

{"_id":"test","startUrl":["https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Retriever&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=","https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Spaniel&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups="],"selectors":[{"id":"judge card","parentSelectors":["_root"],"type":"SelectorLink","selector":"a.m-judge-card__link","multiple":true},{"id":"name","parentSelectors":["judge card"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"phone","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Phone') + dd a","multiple":false,"regex":""},{"id":"address","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Address') + dd","multiple":false,"regex":""},{"id":"judging section","parentSelectors":["judge card"],"type":"SelectorText","selector":"div.m-tabs__panel","multiple":false,"regex":""},{"id":"contact link","parentSelectors":["judge card"],"type":"SelectorLink","selector":"dt:contains('Email') + dd a","multiple":false},{"id":"judges id","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Field Trial judge ID') + dd","multiple":false,"regex":""}]}

@DaveJ19 Hi, it appears the start URLs are identical. Please, note that Web Scraper does not visit the same link twice and any duplicate links are automatically discarded.

Ok, It seems very odd

URL 1 = https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Retriever&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=
URL 2 = https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Spaniel&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=

When I paste each on in to a web browse I get different results back based on the search criteria of the web page.

I'm not understanding how this URL is working for some reason. Which part of the URL gets checked as these two URL's are different after the from the following point

 https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=

The first URL then has "Retriever&SelectedPanelBFieldTrials......etc"

The second URL then has "Spaniel&SelectedPanelBFieldTrials....... etc"

Thanks for your help

@DaveJ19 The scraper traverses pages in pseudo-random order, the order of records in the scraped data will not correspond to the order of start URLs for the sitemap and may change when new start URLs are added - unfortunately, it is currently not possible to change this behavior.