(SOLVED) Extract datas outside of element

Hi,

First of all, Happy New Year to all !

So i have an issue with the page mentionned below which is listing all the movies realeased in France for december 2019 .

I am already able to extract all the movies info i want through my code sitemap below.

Now i would like to also extract the release date for each movie, but this one is not available on each row of the page.
It's only showed once (in h2 tag) for all movies released each wednesday of the month.

eg :
"Films sortis en salles la semaine du 4 décembre 2019" meaning movies released on december 4th 2019
"Films sortis en salles la semaine du 11 décembre 2019" meaning movies released on december 11th 2019
etc...

So how should i add a selector that would extract and repeat the correct release date for each movie ?

Thanks a lot for your help !

Url: http://www.allocine.fr/film/agenda/mois/mois-2019-12/

Sitemap:
{"_id":"allocine","startUrl":["http://www.allocine.fr/film/agenda/mois/mois-2019-12/"],"selectors":[{"id":"movie_list","type":"SelectorElement","parentSelectors":["_root"],"selector":"li.month-movie-item","multiple":true,"delay":0},{"id":"movie_title","type":"SelectorText","parentSelectors":["movie_list"],"selector":"a","multiple":false,"regex":"","delay":0},{"id":"movie_director","type":"SelectorText","parentSelectors":["movie_list"],"selector":"span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"movie_cast","type":"SelectorText","parentSelectors":["movie_list"],"selector":"span:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"movie_link","type":"SelectorElementAttribute","parentSelectors":["movie_list"],"selector":"a","multiple":false,"extractAttribute":"href","delay":0}]}

Happy new year, and welcome.

You can restructure your scraper a bit to create wrappers (grouping) for movies on each date:

{"_id":"allocine-test","startUrl":["http://www.allocine.fr/film/agenda/mois/mois-2019-12/"],"selectors":[{"id":"movie-agenda-month wrappers","type":"SelectorElement","parentSelectors":["_root"],"selector":"div > div.movie-agenda-month.hred","multiple":true,"delay":0},{"id":"Release date","type":"SelectorText","parentSelectors":["movie-agenda-month wrappers"],"selector":"h2","multiple":false,"regex":"","delay":0},{"id":"movie_list","type":"SelectorElement","parentSelectors":["movie-agenda-month wrappers"],"selector":"li.month-movie-item","multiple":true,"delay":0},{"id":"movie_title","type":"SelectorText","parentSelectors":["movie_list"],"selector":"a","multiple":false,"regex":"","delay":0},{"id":"movie_director","type":"SelectorText","parentSelectors":["movie_list"],"selector":"span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"movie_cast","type":"SelectorText","parentSelectors":["movie_list"],"selector":"span:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"movie_link","type":"SelectorElementAttribute","parentSelectors":["movie_list"],"selector":"a","multiple":false,"extractAttribute":"href","delay":0}]}

1 Like

Thanks a lot leemeng, your fix works perfectly :+1:

Now i just realised another issue with this sitemap (which was actually also there before your fix).

Some movies unfortunately appears to have no directors defined and this apparently mess up the extract for the concerned row (as the casting then resides in the director column instead).

See for instance movie 'Friends 25 : Celui qui fĂȘte son anniversaire' or 'Anne Roumanoff dans tout va bien'

Would you think of any trick for fixing this too, eg by filling with null word when this selector is empty ? (just like WS already automatically does when there is no casting defined btw)

Thanks again for your most valuable help !

That's a bit harder, 'cos there are no unique attributes in the selectors. Try this version which uses regex to extract text after "De" and "Avec". If there is no De and/or Avec,it will return "null".

{"_id":"allocine-test3","startUrl":["http://www.allocine.fr/film/agenda/mois/mois-2019-12/"],"selectors":[{"id":"movie-agenda-month wrappers","type":"SelectorElement","parentSelectors":["_root"],"selector":"div > div.movie-agenda-month.hred","multiple":true,"delay":0},{"id":"Release date","type":"SelectorText","parentSelectors":["movie-agenda-month wrappers"],"selector":"h2","multiple":false,"regex":"","delay":0},{"id":"movie_list","type":"SelectorElement","parentSelectors":["movie-agenda-month wrappers"],"selector":"li.month-movie-item","multiple":true,"delay":0},{"id":"movie_title","type":"SelectorText","parentSelectors":["movie_list"],"selector":"a","multiple":false,"regex":"","delay":0},{"id":"movie_director","type":"SelectorText","parentSelectors":["movie_list"],"selector":"span:nth-of-type(1)","multiple":false,"regex":"(?<=De\\s+)(.|\\n)+","delay":0},{"id":"movie_cast","type":"SelectorText","parentSelectors":["movie_list"],"selector":"_parent_","multiple":false,"regex":"(?<=Avec\\s+)(.|\\n)+","delay":0},{"id":"movie_link","type":"SelectorElementAttribute","parentSelectors":["movie_list"],"selector":"a","multiple":false,"extractAttribute":"href","delay":0}]}

1 Like

Man, you're a true King :clap:

Although i get the idea of checking " No 'De' and/or 'Avec' " for testing if director is defined or not, i don't really get how adding both regex plus changing the movie_cast selector to _parent_ resulted in displaying this null word when movie_director is empty but that definitely works as expected !

If you can explain a bit more on how this works, that would be very kind but thanks again anyway for your expertise !