If.. else clause or any other solution?

I try to scrape a website which has two differents kinds of layouts.
The most common layout is sthg like this: https://www.chiquito.co.uk/restaurants/london/croydon/croydon and we just have to put an element around the section.Nap
But sometimes we also have this kind of layout => https://www.chiquito.co.uk/restaurants/london/london and the section.Nap no longer exists, we have to work on Teaser

Url: https://www.chiquito.co.uk/restaurants

I have to try work only on Teaser for every store and then to deduplicate the results, but I have only 77 stores out of 85.
Does anybody know a clean solution to get the 85 stores please ?

Sitemap:
{"startUrl":"https://www.chiquito.co.uk/restaurants/","selectors":[{"parentSelectors":["_root"],"type":"SelectorLink","multiple":true,"id":"first-href","selector":"a.Directory-listLink","delay":""},{"parentSelectors":["first-href"],"type":"SelectorLink","multiple":true,"id":"second-href","selector":"a.Directory-listLink","delay":""},{"parentSelectors":["second-href"],"type":"SelectorElement","multiple":true,"id":"element","selector":"article.Teaser","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"name","selector":"a.Teaser-titleLink","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"address","selector":"div.Teaser-address div.c-AddressRow:nth-of-type(1)","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","id":"city","selector":"span.c-address-city","delay":"","multiple":false,"regex":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"zip_code","selector":"span.c-address-postal-code","regex":"(GIR|[A-Z]\d[A-Z\d]??|[A-Z]{2}\d[A-Z\d]??)[ ]??(\d[A-Z]{2})","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"address2","selector":"span.c-address-street-2","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorElementAttribute","multiple":false,"id":"cid","selector":"a.c-get-directions-button","extractAttribute":"href","delay":""},{"parentSelectors":["element2"],"type":"SelectorText","multiple":false,"id":"address_element2","selector":"span.c-address-street-1","regex":"","delay":""},{"parentSelectors":["element2"],"type":"SelectorText","multiple":false,"id":"city_element2","selector":"span.c-address-city","regex":"","delay":""}],"_id":"chiquito_gbr"}

Thank's in advance,
Nicolas.

Hello all,
Up please ? :slight_smile:

Could you use a combined selector like this? (separated by comma)

section.Nap, section.Directory.Directory--alpha.LocationList article.Teaser

The above concatenates 2 selectors .Nap and .Teaser. (Or is this a "union?)

However, you need to tell .Teaser to be more specific, not to select those "Nearby Chiquito" underneath Croydon, eg. Hence the more specific selector "section.Directory.Directory--alpha.LocationList article.Teaser"

(I haven't tested it. Your sitemap has been somehow reformatted to become "invalid.")

Thank's for your feedback, I have just edited my sitemap. This one should work :slight_smile:

{"startUrl":"https://www.chiquito.co.uk/restaurants/","selectors":[{"parentSelectors":["_root"],"type":"SelectorLink","multiple":true,"id":"first-href","selector":"a.Directory-listLink","delay":""},{"parentSelectors":["first-href"],"type":"SelectorLink","multiple":true,"id":"second-href","selector":"a.Directory-listLink","delay":""},{"parentSelectors":["second-href"],"type":"SelectorElement","multiple":true,"id":"element","selector":"article.Teaser","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"name","selector":"a.Teaser-titleLink","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"address","selector":"div.Teaser-address div.c-AddressRow:nth-of-type(1)","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorText","id":"city","selector":"span.c-address-city","delay":"","multiple":false,"regex":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"zip_code","selector":"span.c-address-postal-code","regex":"(GIR|[A-Z]\d[A-Z\d]??|[A-Z]{2}\d[A-Z\d]??)[ ]??(\d[A-Z]{2})","delay":""},{"parentSelectors":["element"],"type":"SelectorText","multiple":false,"id":"address2","selector":"span.c-address-street-2","regex":"","delay":""},{"parentSelectors":["element"],"type":"SelectorElementAttribute","multiple":false,"id":"cid","selector":"a.c-get-directions-button","extractAttribute":"href","delay":""},{"parentSelectors":["element2"],"type":"SelectorText","multiple":false,"id":"address_element2","selector":"span.c-address-street-1","regex":"","delay":""},{"parentSelectors":["element2"],"type":"SelectorText","multiple":false,"id":"city_element2","selector":"span.c-address-city","regex":"","delay":""}],"_id":"chiquito_gbr"}

Jason, just to let you know for further use of multiple selectors within one field:

  1. If Multiple option is checked, WebScraper will pick both of them.
  2. If Multiple is not checked, WebScraper will pick only first available element. That means, if you have 2 fields on page, it will pick only first. If first is not present on page (at all), it will pick second.
1 Like

Thanks. Just to confirm. In my suggested selector, where 2 CSS selectors are joined by a comma, I should use #1, with Multiple option checked?

If you have 2 elements on page and want to pick only one, you should have Multiple option disabled.
Where .Nap is not present, .Teaser will be scraped instead.


P.S. sitemap is invalid because of regex.

1 Like

@jasond @iconoclast may you provide a sitemap so that we may understand what you are talking about please ?
You can cut the address, zipcode etc, the problem is not here :slight_smile:

Thank's in advance,
Nicolas.

Up guys, an if...else solution would be a huge feature for the scraper :slight_smile:

I don't see how you use your multiple selector, please push your code so that you may understand what you wanna do.

I have a similar issue in the idea with http://www.marvimundoperfumerias.com/perfumerias/result.php?t=1

You can make your element selector more precise by defining more elements before the element that you want to select(like a path). You can separate 2 different selectors with comma, you just have to avoid that they select the same thing. Here is the improved sitemap:

{"_id":"chiquito_gbr","startUrl":["https://www.chiquito.co.uk/restaurants/"],"selectors":[{"id":"first-href","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.Directory-listLink","multiple":true,"delay":""},{"id":"second-href","type":"SelectorLink","parentSelectors":["first-href"],"selector":"a.Directory-listLink","multiple":true,"delay":""},{"id":"element","type":"SelectorElement","parentSelectors":["second-href"],"selector":"ul.Directory-listTeasers article.Teaser, section.Nap","multiple":true,"delay":""},{"id":"name","type":"SelectorText","parentSelectors":["element"],"selector":"h1.Heading, a.Teaser-titleLink","multiple":false,"regex":"","delay":""},{"id":"address","type":"SelectorText","parentSelectors":["element"],"selector":"span.c-address-street-1, div.Teaser-address div.c-AddressRow:nth-of-type(1)","multiple":false,"regex":"","delay":""},{"id":"city","type":"SelectorText","parentSelectors":["element"],"selector":"span.c-address-city","multiple":false,"regex":"","delay":""},{"id":"zip_code","type":"SelectorText","parentSelectors":["element"],"selector":"span.c-address-postal-code","multiple":false,"regex":"","delay":""},{"id":"address2","type":"SelectorText","parentSelectors":["element"],"selector":"span.c-address-street-2","multiple":false,"regex":"","delay":""},{"id":"cid","type":"SelectorElementAttribute","parentSelectors":["element"],"selector":"a.c-get-directions-button","multiple":false,"extractAttribute":"href","delay":""},{"id":"address_element2","type":"SelectorText","parentSelectors":["element2"],"selector":"span.c-address-street-1","multiple":false,"regex":"","delay":""},{"id":"city_element2","type":"SelectorText","parentSelectors":["element2"],"selector":"span.c-address-city","multiple":false,"regex":"","delay":""}]}

You still might want to make 2 separate element selectors to get cid for "Nap".

2 Likes