How to scrape multiple items on a page when they are not in an element

simon_clarke · December 15, 2022, 11:01pm

I'm trying to work out how to scrape items from a page of cafe listings. There are h2 headings for the cafe name, and separate text for the address/postcode (zip code). These are not contained in separate elements (like the convenient laptop listings in the webscraper tutorial videos). They are just in a flow of html text in a div content container.

I can scrape all the cafe names, but when I try to include the postcode the scraper either just scrapes the first postcode and repeats it for every cafe – or the cafe name scrapes first in the csv, and then all the postcodes below them. I can't get the two pieces of information in the same row of the csv for each cafe. Help! What am I not understanding? I am using the text scraper setting, with "multiple" checked.

Url: 20 Best Cafes In Bristol | Amber

Sitemap 1 – same postcode for all cafes:

{"_id":"bristol-cafes","startUrl":["https://amberstudent.com/blog/post/20-best-cafes-in-bristol"],"selectors":[{"id":"cafe-wrapper","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"div.div-block-17","type":"SelectorHTML"},{"id":"cafe-name","multiple":true,"parentSelectors":["cafe-wrapper"],"regex":"","selector":"h2 strong","type":"SelectorText"},{"id":"cafe-postcode","multiple":true,"parentSelectors":["cafe-wrapper"],"regex":"([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})","selector":"p:nth-of-type(n+2) a:nth-of-type(1)","type":"SelectorText"}]}

Sitemap 2 – cafe names and poscodes do not line up on the same row:

{"_id":"bristol-cafes","startUrl":["https://amberstudent.com/blog/post/20-best-cafes-in-bristol"],"selectors":[{"id":"cafe-name","multiple":true,"parentSelectors":["_root"],"regex":"","selector":"h2 strong","type":"SelectorText"},{"id":"cafe-postcode","multiple":true,"parentSelectors":["_root"],"regex":"([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})","selector":"p:nth-of-type(n+2) a:nth-of-type(1)","type":"SelectorText"}]}

ViestursWS · December 19, 2022, 2:46pm

@simon_clarke Hi, it appears that the most viable way to achieve this will require using the 'Element' selector.

Here's an example:

{"_id":"bristol-cafes","startUrl":["https://amberstudent.com/blog/post/20-best-cafes-in-bristol"],"selectors":[{"id":"cafe-wrapper","multiple":true,"parentSelectors":["_root"],"selector":".rich-text-blog h2","type":"SelectorElement"},{"id":"cafe-name","multiple":false,"parentSelectors":["cafe-wrapper"],"regex":"","selector":"strong","type":"SelectorText"},{"id":"cafe-postcode","multiple":false,"parentSelectors":["cafe-wrapper"],"regex":"","selector":"+ p strong:contains(\"Location\") + a","type":"SelectorText"}]}

simon_clarke · December 19, 2022, 6:33pm

That is fabulous – thank you!

I need to work out exactly what the element is selecting – I tried an element selector at first, but I must have chosen the wrong element. But I'll try to replicate it myself.

simon_clarke · December 19, 2022, 6:42pm

Oh, that's weird – when I try to recreate the cafe-postcode selector, I get an error message: "Parent does not contain selected element". Can you tell me how you created that selector? It works – but I can't see how to recreate it!

ViestursWS · December 20, 2022, 3:27pm

@simon_clarke Hi, due to how this page is structured at first you would have to select all of the heading elements and simply add the next paragraph next to it which could be used as its child.

Learn more: Top 5 CSS Selectors You Need to Know.

simon_clarke · December 20, 2022, 9:51pm

Thank you – the CSS selector tutorial looks very useful.