Hi,
I'm getting my extracted datas sometimes shifted because they are not always in the same place on the site i'm working with.
I've found that this is due to the release date of the movie which isn't stored the same way on the site (depending if it's a 1st release or a re-release).
So when it's a re-release, there are 2 dates showing instead of 1, and this shifts all my next values (columns) extracted
The release date selector i defined is a.date
, but when movie has a re-release date, the selector would then apparently be span.date
instead. And i have no idea on how i could deal with this specificity.
I'm also adding screens showing the issue through element preview on my movie_director field which is wrongly populated with release date in the 2nd screen.
Thanks for your help !
Note that i have included 2 urls in my sitemap for seeing the issue more easily.
Url:
http://www.allocine.fr/film/fichefilm_gen_cfilm=215094.html (no re-release date)
http://www.allocine.fr/film/fichefilm_gen_cfilm=1956.html (contains re-release date)
Sitemap:
{"_id":"allocine_detail_tmp","startUrl":["http://www.allocine.fr/film/fichefilm_gen_cfilm=1956.html","http://www.allocine.fr/film/fichefilm_gen_cfilm=215094.html"],"selectors":[{"id":"movie_title","type":"SelectorText","parentSelectors":["_root"],"selector":"div.titlebar-title-lg","multiple":false,"regex":"","delay":0},{"id":"movie-date","type":"SelectorText","parentSelectors":["_root"],"selector":"a.date","multiple":false,"regex":"","delay":0},{"id":"movie_duration","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(1)","multiple":false,"regex":"(?<=\().+n","delay":0},{"id":"movie_director","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"movie_genre","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"movie_country","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"movie_prod_year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.item:nth-of-type(1) span.that","multiple":false,"regex":"","delay":0},{"id":"movie_dist","type":"SelectorText","parentSelectors":["_root"],"selector":"div.item:nth-of-type(1) a","multiple":false,"regex":"","delay":0},{"id":"movie_gross","type":"SelectorText","parentSelectors":["_root"],"selector":".more-hidden a.blue-link","multiple":false,"regex":"","delay":0},{"id":"movie_title_vo","type":"SelectorText","parentSelectors":["_root"],"selector":"h2.that","multiple":false,"regex":"","delay":0},{"id":"movie_cast","type":"SelectorText","parentSelectors":["_root"],"selector":"div.meta-body-item:nth-of-type(3)","multiple":false,"regex":"","delay":0}]}