How to extract values from elements that appear once or more times?

Good day, some advice in how to scrape for several elements that sometimes appear once and sometimes appear several times.

For example in site1 (below is my sitemap) appears only one store and its information. I've selected "selectors" for name (h1), address1, address2 and address3. Then in site2 appears 3 stores and for each store I'd like to scrape name, address1, address2, address3.

The main site has several cities, and some cities have one store and some have several stores.

1-) How to tell to handle both cases in same sitemap?
2-) How to do in order that in output file, the name and addresses appear in same row and different columns?

Thanks in advance.

Url1 for city1 (one store): site1
Url2 for city2 (multiple stores): site2

Sitemap for city1:
{"_id":"store","startUrl":["https://stores.aldi.us/ks/hutchinson/1711-n-waldron-st"],"selectors":[{"id":"name","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"address1","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"[itemprop='address'] span.Address-line1","type":"SelectorText"},{"id":"address2","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"[itemprop='address'] span.Address-city","type":"SelectorText"},{"id":"address3","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"span[itemprop='addressRegion']","type":"SelectorText"},{"id":"address4","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"span[itemprop='postalCode']","type":"SelectorText"}]}

Hi,

You can add a link selector to the sitemap to click on the multiple locations and make the other selectors as child selectors of the root and link selector. This way the sitemap will work in both scenarios.

{"_id":"store","startUrl":["https://stores.aldi.us/ks/hutchinson/1711-n-waldron-st","https://stores.aldi.us/ks/overland-park"],"selectors":[{"id":"store-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":".Directory-content a.Teaser-titleLink","type":"SelectorLink"},{"id":"name","multiple":false,"parentSelectors":["_root","store-link"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"address1","multiple":false,"parentSelectors":["_root","store-link"],"regex":"","selector":"[itemprop='address'] span.Address-line1","type":"SelectorText"},{"id":"address2","multiple":false,"parentSelectors":["_root","store-link"],"regex":"","selector":"[itemprop='address'] span.Address-city","type":"SelectorText"},{"id":"address3","multiple":false,"parentSelectors":["_root","store-link"],"regex":"","selector":"span[itemprop='addressRegion']","type":"SelectorText"},{"id":"address4","multiple":false,"parentSelectors":["_root","store-link"],"regex":"","selector":"span[itemprop='postalCode']","type":"SelectorText"}]}
1 Like

[quote="JanAp, post:2, topic:14236"]
[/quote] Thank you, thank you. It works. I see that in case of 3 stores, you open each one because each store has its own link. I´ve seen some website that, for example, only has the information of the 3 stores in a single URL without a link.

Last question regarding this, maybe you have time.

Always using the URL2 that has 3 stores, how to retrieve the data for each store directly without open one by one? and in output get the same, this is, have each store in different row with name and addresses in different columns.

Hi, yes, that can be set up. In that case, a wrapper selector (store) has to be created:

{"_id":"store3","startUrl":["https://stores.aldi.us/ks/hutchinson/1711-n-waldron-st","https://stores.aldi.us/ks/overland-park"],"selectors":[{"id":"store","multiple":true,"parentSelectors":["_root"],"selector":".Directory-content .Teaser-wrapper","type":"SelectorElement"},{"id":"name","multiple":false,"parentSelectors":["_root","store"],"regex":"","selector":".Teaser-info h3, h1","type":"SelectorText"},{"id":"address1","multiple":false,"parentSelectors":["_root","store"],"regex":"","selector":".Teaser-info span.Address-line1, [itemprop='address'] span.Address-line1","type":"SelectorText"},{"id":"address2","multiple":false,"parentSelectors":["_root","store"],"regex":"","selector":".Teaser-info span.Address-city, [itemprop='address'] span.Address-city","type":"SelectorText"},{"id":"address3","multiple":false,"parentSelectors":["_root","store"],"regex":"","selector":".Teaser-info span.Address-region, span[itemprop='addressRegion']","type":"SelectorText"},{"id":"address4","multiple":false,"parentSelectors":["_root","store"],"regex":"","selector":".Teaser-info span.Address-postalCode, span[itemprop='postalCode']","type":"SelectorText"}]}
1 Like