One simple doubt abut scraping

mouraomatheus · May 5, 2018, 5:02pm

Hello everybody!

Im new at scraping websites, and Im doing some mistakes at creating a rule to scrape a coffee website. I want to know if anyone could help me identifying the mistake and how to create the correct rule.

The Url is: http://guiadecafeterias.com.br/cafeterias?state=&city=&district=&search=

First, I want to scrape all the 16 pages of coffee shops, an then some specific attributes as coffee shop name, what they serve (coffee methods), coffee shop address, website, website, phone, farm, etc. Ive attached on this topic the selector graph that Ive made it.

tree map coffee shop

Does somebody can help me?

Many thanks,

Matheus

chefas · May 6, 2018, 8:33am

Hi,

dont post the selector graph.

Post the Json code ( = Export Sitemap )

Thanks

mouraomatheus · May 6, 2018, 8:03pm

Hi @chefas!

Ok. The sitemap is:

{"_id":"nome_cafeterias","startUrl":["http://guiadecafeterias.com.br/cafeterias/p/[2-16]?state=&city=&district=&search="],"selectors":[{"id":"cafeteria","type":"SelectorElement","selector":"div.place-list:nth-of-type(2) div.overflow","parentSelectors":["_root"],"multiple":false,"delay":0},{"id":"fazenda","type":"SelectorText","selector":"div.informations p:nth-of-type(1)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"metodos","type":"SelectorText","selector":"div.informations p:nth-of-type(3)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"endereço","type":"SelectorText","selector":"div.services p:nth-of-type(1)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"telefone","type":"SelectorText","selector":"div.services p:nth-of-type(4)","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0},{"id":"site","type":"SelectorLink","selector":"p a","parentSelectors":["cafeteria"],"multiple":false,"delay":0},{"id":"sobre_cafeteria","type":"SelectorText","selector":"div.resume","parentSelectors":["cafeteria"],"multiple":false,"regex":"","delay":0}]}

Thanks in advance

Cheers

chefas · May 7, 2018, 1:10pm

Hello,

have a look at this.
Observe the changes in the URL of the Metadata.
It seems to work . You have to test it more and tell me.

{"_id":"test","startUrl":["http://guiadecafeterias.com.br/cafeterias/p/[1-16]?state=&city=&district=&search="],"selectors":[{"id":"cafeteria","type":"SelectorElement","selector":"section","parentSelectors":["_root"],"multiple":true,"delay":"3000"},{"id":"link","type":"SelectorLink","selector":"h3 a","parentSelectors":["cafeteria"],"multiple":true,"delay":"500"},{"id":"title","type":"SelectorText","selector":"div.heading h2","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","selector":"div.services p:nth-of-type(1)","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"tel","type":"SelectorText","selector":"div.services p:nth-of-type(4)","parentSelectors":["link"],"multiple":false,"regex":"","delay":0}]}

mouraomatheus · May 7, 2018, 2:37pm

Hey @chefas,

Many thanks! I'll try it now and say to you if it worked.

mouraomatheus · May 10, 2018, 5:03pm

Hi @chefas, it has worked! Many thanks again

chefas · May 10, 2018, 5:09pm

Ok you are well come my friend

mouraomatheus · May 10, 2018, 5:13pm

Thanks!
I have another doubt but in another website, that needs a login. It's the Hubspot Academy website, and I am trying to collect data of names and lesson links, but the sitemap is not responding well.

The website is https://app.hubspot.com/learning-center/404934/lessons

The json is:

{"_id":"hubspot_academy","startUrl":["https://app.hubspot.com/learning-center/404934/lessons?page/=[1-11]&search="],"selectors":[{"id":"lesson_name","type":"SelectorText","selector":"a.private-link div.is--module","parentSelectors":["_root"],"multiple":true,"regex":"","delay":0},{"id":"lesson_link","type":"SelectorLink","selector":"div.lessons-page__content a.private-link","parentSelectors":["_root"],"multiple":true,"delay":0}]}

If you could help me

Cheers

chefas · May 11, 2018, 11:57am

Hello

impossible to get an login after 3 attempts, so I give up