Multiple Start URL's

DaveJ19 · April 12, 2023, 9:01pm

How does this work? I have defined two Start URL links in the site map meta data. I save this and get presented at the _root with the option to 'Add new selector'

How are the multiple start URL's iterated. In my situation the multiple start URL's are actually the links to a returned page from different search criteria. I require the same scrap from each.

At present when I try to save the results I'm only getting the data from one of the Start URL's

ViestursWS · April 13, 2023, 1:35pm

@DaveJ19 Hello, multiple start URLs for a sitemap can be added via the UI of Web Scraper Cloud(handles up to 20'000 start URLs).

DaveJ19 · April 13, 2023, 1:48pm

I don't use Cloud. The Chrome plugin allows me to create multiple start URL's but I don't understand how they are used as my output is only ever from one of the start URL's

ViestursWS · April 13, 2023, 1:50pm

@DaveJ19 Could you, please, provide the sitemaps JSON?

DaveJ19 · April 13, 2023, 2:46pm

Here's the JSON

{"_id":"test","startUrl":["The Kennel Club card","multiple":true,"parentSelectors":["_root"],"selector":"a.m-judge-card__link","type":"SelectorLink"},{"id":"name","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"phone","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Phone') + dd a","type":"SelectorText"},{"id":"address","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Address') + dd","type":"SelectorText"},{"id":"judging section","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"div.m-tabs__panel","type":"SelectorText"},{"id":"contact link","multiple":false,"parentSelectors":["judge card"],"selector":"dt:contains('Email') + dd a","type":"SelectorLink"},{"id":"judges id","multiple":false,"parentSelectors":["judge card"],"regex":"","selector":"dt:contains('Field Trial judge ID') + dd","type":"SelectorText"}]}

I only seem to get the information from the last start URL

ViestursWS · April 14, 2023, 10:07am

@DaveJ19 Hi, please, apply the preformatted text option after pasting the sitemap, otherwise the JSON is invalid due to autoformat.

DaveJ19 · April 14, 2023, 11:06am

{"_id":"test","startUrl":["https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Retriever&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=","https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Spaniel&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups="],"selectors":[{"id":"judge card","parentSelectors":["_root"],"type":"SelectorLink","selector":"a.m-judge-card__link","multiple":true},{"id":"name","parentSelectors":["judge card"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"phone","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Phone') + dd a","multiple":false,"regex":""},{"id":"address","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Address') + dd","multiple":false,"regex":""},{"id":"judging section","parentSelectors":["judge card"],"type":"SelectorText","selector":"div.m-tabs__panel","multiple":false,"regex":""},{"id":"contact link","parentSelectors":["judge card"],"type":"SelectorLink","selector":"dt:contains('Email') + dd a","multiple":false},{"id":"judges id","parentSelectors":["judge card"],"type":"SelectorText","selector":"dt:contains('Field Trial judge ID') + dd","multiple":false,"regex":""}]}

ViestursWS · April 14, 2023, 11:56am

@DaveJ19 Hi, it appears the start URLs are identical. Please, note that Web Scraper does not visit the same link twice and any duplicate links are automatically discarded.

DaveJ19 · April 14, 2023, 12:13pm

Ok, It seems very odd

URL 1 = https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Retriever&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=
URL 2 = https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=Spaniel&SelectedPanelBFieldTrials=&SelectedJcfLevels=&SelectedSearchOptions=&SelectedSearchOptionsNotActivity=Field+trials&Championship=False&NonChampionship=False&PanelA=True&PanelB=False&Location=&Distance=15&TotalResults=0&Sort=&SearchProfile=True&SelectedBestInBreedGroups=&SelectedBestInSubGroups=

When I paste each on in to a web browse I get different results back based on the search criteria of the web page.

I'm not understanding how this URL is working for some reason. Which part of the URL gets checked as these two URL's are different after the from the following point

 https://www.thekennelclub.org.uk/search/find-a-judge/?KeywordSearch=&Breed=&SelectedChampionshipActivities=&SelectedNonChampionshipActivities=&SelectedPanelAFieldTrials=

The first URL then has "Retriever&SelectedPanelBFieldTrials......etc"

The second URL then has "Spaniel&SelectedPanelBFieldTrials....... etc"

Thanks for your help

ViestursWS · April 14, 2023, 12:23pm

@DaveJ19 The scraper traverses pages in pseudo-random order, the order of records in the scraped data will not correspond to the order of start URLs for the sitemap and may change when new start URLs are added - unfortunately, it is currently not possible to change this behavior.