How to scrape the phone, email, and location from a business listing

Hello all. I am new here so having some difficulties with my first real webscrape. I am trying to click into each business listing and scrape the business name, location, phone number, and email address. I was confident I was doing this correctly, but only the Business Name is being scraped - not the location, phone number, or email address. Would someone mind helping out the new guy?! Thank you!

Url: Recently Featured Firms - Architizer

Sitemap:
{"_id":"Architizer","startUrl":["Recently Featured Firms - Architizer Link","parentSelectors":["pagination"],"type":"SelectorLink","selector":"a.fw-medium","multiple":true,"linkType":"linkFromHref"},{"id":"Business Name","parentSelectors":["Page Link"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"Location","parentSelectors":["Page Link"],"type":"SelectorText","selector":"[data-id='296545'] span.placeholder","multiple":false,"regex":""},{"id":"Phone","parentSelectors":["Page Link"],"type":"SelectorText","selector":"#farmer-payne-architects-phone_numbers span.placeholder","multiple":false,"regex":""},{"id":"Email","parentSelectors":["Page Link"],"type":"SelectorText","selector":"#farmer-payne-architects-email_addresses div.control","multiple":false,"regex":""}]}

You can try this one

{"_id":"Architizer","startUrl":["https://architizer.com/firms/project-type=Private%20House/firm-location=United%20States"],"selectors":[{"id":"Link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"a.fw-medium","type":"SelectorLink"},{"id":"Business Name","multiple":false,"parentSelectors":["Link"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"Ema","multiple":true,"parentSelectors":["Link"],"selector":"div.meta-card","type":"SelectorElement"},{"id":"Phone","multiple":false,"parentSelectors":["Link"],"regex":"","selector":".blank-ui .placeholder.single-line:contains(work:)","type":"SelectorText"},{"id":"Location","multiple":false,"parentSelectors":["Link"],"regex":"","selector":"span.js-rendered-content","type":"SelectorText"},{"id":"Email","multiple":true,"parentSelectors":["Ema"],"regex":"","selector":"[href^=\"mailto\"]","type":"SelectorText"}]}

@Jakirhasan Thanks a lot for your help. This worked for the first page. Would you mind helping me understand how to get the pagination working properly now? There is a "load more" button that loads more business listings without changing the url. I would like to continue load more listings and scrape the information like you did for the first. Properly incorporating the pagination is causing me some problems. I really appreciate your help here!

Edit: I have watched the pagination tutorials, but am still running into a roadblock for some reason.

Update: I was able to get pagination to work, which is great so thank you again for your help. I am running into a bot filtering issue and as a result, 500 out of the 1500 page links were not scraped. Do you have any recommendations on how to get around this issue? I am using the free chrome extension currently . @Jakirhasan

You can try changing the request interval to 10000+ and the page load delay to 4000+

{"_id":"Architizer","startUrl":["https://architizer.com/firms/project-type=Private%20House/firm-location=United%20States"],"selectors":[{"id":"Link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root","Page"],"selector":"a.fw-medium","type":"SelectorLink"},{"id":"Business Name","multiple":false,"parentSelectors":["Link"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"Ema","multiple":true,"parentSelectors":["Link"],"selector":"div.meta-card","type":"SelectorElement"},{"id":"Phone","multiple":false,"parentSelectors":["Link"],"regex":"","selector":".blank-ui .placeholder.single-line:contains(work:)","type":"SelectorText"},{"id":"Location","multiple":false,"parentSelectors":["Link"],"regex":"","selector":"span.js-rendered-content","type":"SelectorText"},{"id":"Email","multiple":true,"parentSelectors":["Ema"],"regex":"","selector":"[href^=\"mailto\"]","type":"SelectorText"},{"id":"Page","paginationType":"clickMore","parentSelectors":["_root","Page"],"selector":"button.button","type":"SelectorPagination"}]}

@Jakirhasan that did it! Thanks again for your assistance - it is much appreciated!

1 Like