Zomato links + company details

Hi,
I'm trying to scrape zomato links on this page and then get all the details of each and every link

this is the sitemap

{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":".sc-eDZMvD","multiple":true,"delay":"3000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-iYUSvU","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.col-md-8","multiple":true,"delay":0},{"id":"website","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"li:nth-of-type(1) a","multiple":false,"extractAttribute":"href","delay":0},{"id":"linkedin","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"a.in","multiple":false,"extractAttribute":"href","delay":0},{"id":"companyname","type":"SelectorText","parentSelectors":["cards"],"selector":"p","multiple":false,"regex":"","delay":0}]}

what am i missing?
thanks!

i edited the sitemap

{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div.sc-keIums","multiple":true,"delay":"3000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-iYUSvU","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.col-md-8","multiple":true,"delay":0},{"id":"website","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"li:nth-of-type(1) a","multiple":false,"extractAttribute":"href","delay":0},{"id":"linkedin","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"a.in","multiple":false,"extractAttribute":"href","delay":0},{"id":"companyname","type":"SelectorText","parentSelectors":["cards"],"selector":"p","multiple":false,"regex":"","delay":0},{"id":"clicklinks","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.sc-bnXvFD","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"h1.sc-dlyikq","multiple":false,"regex":"","delay":0},{"id":"category","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"#root > div > main > div > section.sc-hZeNU.bqUezT > section > section.sc-bYTsla.jfhrNF > section.sc-fFTYTi.bdVggf > div","multiple":false,"regex":"","delay":0},{"id":"city","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"a.sc-gFXMyG","multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"p.kKemRh","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"p.clKRrC","multiple":false,"regex":"","delay":0}]}

however i get only a few results and they are all mixed up.
Any idea?
thanks!

Hi, I took a look at your sitemap and there are some things not defined correctly, also could you take a screenshot from where exactly you took the webpage and Linkedin links?

Hi viesturs,
this is the page im trying to scrape
https://www.zomato.com/grande-lisboa/dine-out-in-ericeira
i need all the links (infinite scroll) then click on each and every link to grab the details (phone numbers, addresses, etc)
the linkedin id was just a name i added, nothing related to linkedin.com
thanks!

So I tried to scroll at the very bottom of the page manually so the scroll selector would end up but the page showed me multiple times that it failed to load more content, so the 1st issue starts that the Web Scraper can't finish the 1st scraping phase by scrolling down till the last item. So it's the page issue at the moment. I made little adjustments tho to your sitemap as well.

My version:
{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div[class*="jumbo-tracker"]","multiple":true,"delay":2000},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-dyfhso","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Call") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Direction") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains("Average Cost") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="bObnWx"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="kMAHHC"]","multiple":false,"regex":"","delay":0}]}

I think if you increase the scroll delay time from 2000 up to 5000 or higher, it might solve the issue. Anyway the most important thing is to get till the very last item.

EDIT:

{"_id":"zomato1","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div div div div div:has(.jumbo-tracker)","multiple":true,"delay":"500"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a:nth(0)","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Call") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Direction") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains("Average Cost") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="bObnWx"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="kMAHHC"]","multiple":false,"regex":"","delay":0}]}

*Item links weren't functional

Hi @ViestursWS !
thanks a lot, really appreciate it!
Btw i tried to import both your sitemaps and i get "invalid json"
https://monosnap.com/file/ibceWtTNdXLn7hT71gVAKx2WbzGIms
maybe some typo?
thanks!

1 Like
indent preformatted text by 4 spaces{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div div div div div:has(.jumbo-tracker)","multiple":true,"delay":"5000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a:nth(0)","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains(\"Call\") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains(\"Direction\") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains(\"Average Cost\") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width=\"max-content\"] p[class*=\"bObnWx\"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width=\"max-content\"] p[class*=\"kMAHHC\"]","multiple":false,"regex":"","delay":0}]}    `Preformatted text`

Hi! Try to copy it now starting from the "{", I guess when I try to copy and paste it in a regular manner it adds some unnecessary spaces and that's why afterwards it appears invalid so I had to use the preformatted text option. Hope it works now!

yayyyy! you rock viesturs!
it works like charm! :rocket:
thanks again!

1 Like

You're welcome! :smiley: If this validation error repeats in the future, I recommend taking a look at JSON Lint validator - https://jsonlint.com/

thanks for the suggestion! really appreciate it!