(SOLVED) Shifted datas due to inconsistent site structure

Hi,

I'm getting my extracted datas sometimes shifted because they are not always in the same place on the site i'm working with.

I've found that this is due to the release date of the movie which isn't stored the same way on the site (depending if it's a 1st release or a re-release).

So when it's a re-release, there are 2 dates showing instead of 1, and this shifts all my next values (columns) extracted :confused:

The release date selector i defined is a.date, but when movie has a re-release date, the selector would then apparently be span.date instead. And i have no idea on how i could deal with this specificity.

I'm also adding screens showing the issue through element preview on my movie_director field which is wrongly populated with release date in the 2nd screen.

Thanks for your help !

Note that i have included 2 urls in my sitemap for seeing the issue more easily.

Url:
http://www.allocine.fr/film/fichefilm_gen_cfilm=215094.html (no re-release date)
http://www.allocine.fr/film/fichefilm_gen_cfilm=1956.html (contains re-release date)

Sitemap:
{"_id":"allocine_detail_tmp","startUrl":["http://www.allocine.fr/film/fichefilm_gen_cfilm=1956.html","http://www.allocine.fr/film/fichefilm_gen_cfilm=215094.html"],"selectors":[{"id":"movie_title","type":"SelectorText","parentSelectors":["_root"],"selector":"div.titlebar-title-lg","multiple":false,"regex":"","delay":0},{"id":"movie-date","type":"SelectorText","parentSelectors":["_root"],"selector":"a.date","multiple":false,"regex":"","delay":0},{"id":"movie_duration","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(1)","multiple":false,"regex":"(?<=\().+n","delay":0},{"id":"movie_director","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"movie_genre","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"movie_country","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"movie_prod_year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.item:nth-of-type(1) span.that","multiple":false,"regex":"","delay":0},{"id":"movie_dist","type":"SelectorText","parentSelectors":["_root"],"selector":"div.item:nth-of-type(1) a","multiple":false,"regex":"","delay":0},{"id":"movie_gross","type":"SelectorText","parentSelectors":["_root"],"selector":".more-hidden a.blue-link","multiple":false,"regex":"","delay":0},{"id":"movie_title_vo","type":"SelectorText","parentSelectors":["_root"],"selector":"h2.that","multiple":false,"regex":"","delay":0},{"id":"movie_cast","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(3)","multiple":false,"regex":"","delay":0}]}

You can combine two or more selectors by separating them with a comma, so:

a.date,span.date

If one of them is not found, WS will just scrape from the other. If both are not found, WS will return "null".

Ref: https://www.w3schools.com/cssref/css_selectors.asp

1 Like

Ok, it works correctly for fixing the release date, but unfortunately, the next columns are also still shifted (like movie_director as described in my 2 screens i included in my 1st post).

And the same goes for columns movie_genre, movie_cast and movie_country.

This is again due to selectors which are changing, eg for movie_director, it can be div.meta-body-item:nth-of-type(2) or div.meta-body-item:nth-of-type(3) (depending of whether the movie is an initial release or a re-release).

So i tried defining both selectors separated with a comma (like you previously explained) but then i'm stucked when trying to define a correct regex for catching only the movie director value (which is always preceded with 'De ').

For some reason, this regex works in regexr.com but it returns null in WS : (?<=De\s).+

And i mean that it fails, only if the movie is a re-release, like for instance with this movie link (link that was in my initial sitemap added in my 1st post btw). When the movie is an initial release, as the director appears on the 1st line of both selectors combined (and not on the second line), then it works correctly.

I guess it has something to do with properly handling the multiline. I tried playing with the \n code but couldn't figure it out so far.

Also, i wonder if there wouldn't be any better solution than catching proper values through regex as it might be risky if my boundary keywords appears elsewhere... ?

Thanks again for your help and time on this @leemeng

Well i think i finally got it working !

I guess i was wrong trying to achieve my regex for my movie_director field with both selectors div.meta-body-item:nth-of-type(2),div.meta-body-item:nth-of-type(3) because i came to understood that (as you actually wrote) it catches one OR the other, thus why my regex could never operate succesfully on both lines (at least, that's what i guess...).

So i instead used the selector .entity-card div.meta-body which catches all the lines i was interested with.

And i then could apply the following regex (?<=De\s+)([\s\S]+)(?=Avec) which perfectly captures the director.

And i could apply the same logic for all my other fields that were failing :slightly_smiling_face:

Feel free to comment if my solution seems ok to you too @leemeng (or if you might think of any better one) and many thanks again for your help as you definitely put me on the right path.

For Director, (?<=De\s+)(.|\n)+(?=Avec) would also work though your regex should be fine. I would caution against overusing regex if plain CSS can do the job. For instance you can use contains to pick out divs based on their text content, e.g.:

div.meta-body > div.meta-body-item:contains('Date de sortie')
div.meta-body > div.meta-body-item:contains('Avec')
div.meta-body > div.meta-body-item:contains('Genres')

These would not be affected by the div positions change..

1 Like

Many thanks again for your great advices, i didn't knew about that contains CSS selector option. And it's indeed certainly safer to use this instead of looking through too many lines (using a too large selector) as the larger my regex string to search within is, the more i might catch one of my pattern somewhere i wouldn't want to. Besides, i guess looking through smaller text strings might anyhow also improve performances a bit...