One simple doubt abut scraping

Hello everybody!

Im new at scraping websites, and Im doing some mistakes at creating a rule to scrape a coffee website. I want to know if anyone could help me identifying the mistake and how to create the correct rule.

The Url is: http://guiadecafeterias.com.br/cafeterias?state=&city=&district=&search=

First, I want to scrape all the 16 pages of coffee shops, an then some specific attributes as coffee shop name, what they serve (coffee methods), coffee shop address, website, website, phone, farm, etc. Ive attached on this topic the selector graph that Ive made it.

tree map coffee shop

Does somebody can help me?

Many thanks, :slight_smile:

Matheus

Hi,

dont post the selector graph.

Post the Json code ( = Export Sitemap )

Thanks

Hi @chefas!

Ok. The sitemap is:

{"_id":"nome_cafeterias","startUrl":["http://guiadecafeterias.com.br/cafeterias/p/[2-16]?state=&city=&district=&search="],"selectors":[{"id":"cafeteria","type":"SelectorElement","selector":"div.place-list:nth-of-type(2) div.overflow","parentSelectors":["_root"],"multiple":false,"delay":0},{"id":"fazenda","type":"SelectorText","selector":"div.informations p:nth-of-type(1)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"metodos","type":"SelectorText","selector":"div.informations p:nth-of-type(3)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"endereço","type":"SelectorText","selector":"div.services p:nth-of-type(1)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"telefone","type":"SelectorText","selector":"div.services p:nth-of-type(4)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"site","type":"SelectorLink","selector":"p a","parentSelectors":["cafeteria"],"multiple":false,"delay":0},{"id":"sobre_cafeteria","type":"SelectorText","selector":"div.resume","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0}]}

Thanks in advance :slight_smile:

Cheers

Hello,

have a look at this.
Observe the changes in the URL of the Metadata.
It seems to work . You have to test it more and tell me.

{"_id":"test","startUrl":["http://guiadecafeterias.com.br/cafeterias/p/[1-16]?state=&city=&district=&search="],"selectors":[{"id":"cafeteria","type":"SelectorElement","selector":"section","parentSelectors":["_root"],"multiple":true,"delay":"3000"},{"id":"link","type":"SelectorLink","selector":"h3 a","parentSelectors":["cafeteria"],"multiple":true,"delay":"500"},{"id":"title","type":"SelectorText","selector":"div.heading h2","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","selector":"div.services p:nth-of-type(1)","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"tel","type":"SelectorText","selector":"div.services p:nth-of-type(4)","parentSelectors":["link"],"multiple":false,"regex":"","delay":0}]}

Hey @chefas,

Many thanks! I'll try it now and say to you if it worked.

:slight_smile:

Hi @chefas, it has worked! Many thanks again :slight_smile:

Ok you are well come my friend

Thanks!
I have another doubt but in another website, that needs a login. It's the Hubspot Academy website, and I am trying to collect data of names and lesson links, but the sitemap is not responding well.

The website is https://app.hubspot.com/learning-center/404934/lessons

The json is:

{"_id":"hubspot_academy","startUrl":["https://app.hubspot.com/learning-center/404934/lessons?page/=[1-11]&search="],"selectors":[{"id":"lesson_name","type":"SelectorText","selector":"a.private-link div.is--module","parentSelectors":["_root"],"multiple":true,"regex":"","delay":0},{"id":"lesson_link","type":"SelectorLink","selector":"div.lessons-page__content a.private-link","parentSelectors":["_root"],"multiple":true,"delay":0}]}

If you could help me :slight_smile:

Cheers

Hello

impossible to get an login after 3 attempts, so I give up