Zomato links + company details

eldoland · February 22, 2021, 5:18pm

Hi,
I'm trying to scrape zomato links on this page and then get all the details of each and every link

this is the sitemap

{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":".sc-eDZMvD","multiple":true,"delay":"3000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-iYUSvU","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.col-md-8","multiple":true,"delay":0},{"id":"website","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"li:nth-of-type(1) a","multiple":false,"extractAttribute":"href","delay":0},{"id":"linkedin","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"a.in","multiple":false,"extractAttribute":"href","delay":0},{"id":"companyname","type":"SelectorText","parentSelectors":["cards"],"selector":"p","multiple":false,"regex":"","delay":0}]}

what am i missing?
thanks!

eldoland · February 23, 2021, 7:49am

i edited the sitemap

{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div.sc-keIums","multiple":true,"delay":"3000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-iYUSvU","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.col-md-8","multiple":true,"delay":0},{"id":"website","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"li:nth-of-type(1) a","multiple":false,"extractAttribute":"href","delay":0},{"id":"linkedin","type":"SelectorElementAttribute","parentSelectors":["elements"],"selector":"a.in","multiple":false,"extractAttribute":"href","delay":0},{"id":"companyname","type":"SelectorText","parentSelectors":["cards"],"selector":"p","multiple":false,"regex":"","delay":0},{"id":"clicklinks","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.sc-bnXvFD","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"h1.sc-dlyikq","multiple":false,"regex":"","delay":0},{"id":"category","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"#root > div > main > div > section.sc-hZeNU.bqUezT > section > section.sc-bYTsla.jfhrNF > section.sc-fFTYTi.bdVggf > div","multiple":false,"regex":"","delay":0},{"id":"city","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"a.sc-gFXMyG","multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"p.kKemRh","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["clicklinks"],"selector":"p.clKRrC","multiple":false,"regex":"","delay":0}]}

however i get only a few results and they are all mixed up.
Any idea?
thanks!

ViestursWS · February 23, 2021, 1:41pm

Hi, I took a look at your sitemap and there are some things not defined correctly, also could you take a screenshot from where exactly you took the webpage and Linkedin links?

eldoland · February 23, 2021, 6:31pm

Hi viesturs,
this is the page im trying to scrape
https://www.zomato.com/grande-lisboa/dine-out-in-ericeira
i need all the links (infinite scroll) then click on each and every link to grab the details (phone numbers, addresses, etc)
the linkedin id was just a name i added, nothing related to linkedin.com
thanks!

ViestursWS · February 24, 2021, 12:57pm

So I tried to scroll at the very bottom of the page manually so the scroll selector would end up but the page showed me multiple times that it failed to load more content, so the 1st issue starts that the Web Scraper can't finish the 1st scraping phase by scrolling down till the last item. So it's the page issue at the moment. I made little adjustments tho to your sitemap as well.

My version:
{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div[class*="jumbo-tracker"]","multiple":true,"delay":2000},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a.sc-dyfhso","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Call") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Direction") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains("Average Cost") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="bObnWx"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="kMAHHC"]","multiple":false,"regex":"","delay":0}]}

ViestursWS · February 24, 2021, 1:00pm

I think if you increase the scroll delay time from 2000 up to 5000 or higher, it might solve the issue. Anyway the most important thing is to get till the very last item.

ViestursWS · February 24, 2021, 1:42pm

EDIT:

{"_id":"zomato1","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div div div div div:has(.jumbo-tracker)","multiple":true,"delay":"500"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a:nth(0)","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Call") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains("Direction") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains("Average Cost") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="bObnWx"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width="max-content"] p[class*="kMAHHC"]","multiple":false,"regex":"","delay":0}]}

*Item links weren't functional

eldoland · February 25, 2021, 8:32am

Hi @ViestursWS !
thanks a lot, really appreciate it!
Btw i tried to import both your sitemaps and i get "invalid json"
https://monosnap.com/file/ibceWtTNdXLn7hT71gVAKx2WbzGIms
maybe some typo?
thanks!

ViestursWS · February 25, 2021, 10:25am

indent preformatted text by 4 spaces{"_id":"zomato","startUrl":["https://www.zomato.com/grande-lisboa/dine-out-in-ericeira"],"selectors":[{"id":"cards","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div div div div div:has(.jumbo-tracker)","multiple":true,"delay":"5000"},{"id":"companylink","type":"SelectorLink","parentSelectors":["cards"],"selector":"a:nth(0)","multiple":true,"delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["companylink"],"selector":"body:has(div#root h1)","multiple":true,"delay":0},{"id":"place-title","type":"SelectorText","parentSelectors":["element-card"],"selector":"div#root h1:nth(1)","multiple":false,"regex":"","delay":0},{"id":"phone-number","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains(\"Call\") + p","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["element-card"],"selector":"h5:contains(\"Direction\") + div + p","multiple":false,"regex":"","delay":0},{"id":"average-cost","type":"SelectorText","parentSelectors":["element-card"],"selector":"h3:contains(\"Average Cost\") + p","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width=\"max-content\"] p[class*=\"bObnWx\"]","multiple":false,"regex":"","delay":0},{"id":"reviews","type":"SelectorText","parentSelectors":["element-card"],"selector":"section[width=\"max-content\"] p[class*=\"kMAHHC\"]","multiple":false,"regex":"","delay":0}]}    `Preformatted text`

ViestursWS · February 25, 2021, 10:48am

Hi! Try to copy it now starting from the "{", I guess when I try to copy and paste it in a regular manner it adds some unnecessary spaces and that's why afterwards it appears invalid so I had to use the preformatted text option. Hope it works now!

eldoland · February 25, 2021, 11:11am

yayyyy! you rock viesturs!
it works like charm!
thanks again!

ViestursWS · February 25, 2021, 3:19pm

You're welcome! If this validation error repeats in the future, I recommend taking a look at JSON Lint validator - https://jsonlint.com/

eldoland · February 26, 2021, 4:08pm

thanks for the suggestion! really appreciate it!