Need bit of help with scraping and formatting in the correct output

Hi,

I'm successfully scraped the webpage and the child pages. But the release dates isn't in the same row as the release. How can I fix this so it's in the same row?


Thanks!

Url: Nieuw op Videoland in week 1 2025: nieuwe films en series - Nieuw Deze Week

Sitemap:
{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/page/[1-2]"],"selectors":[{"id":"media-cards","parentSelectors":["_root"],"type":"SelectorLink","selector":"a.movie__title","multiple":true,"linkType":"linkFromHref"},{"id":"videoland-link","parentSelectors":["media-cards"],"type":"SelectorLink","selector":"a.btn-md:nth-of-type(1)","multiple":false,"linkType":"linkFromHref"},{"id":"ReleaseDates","parentSelectors":["_root"],"type":"SelectorText","selector":"a.btn-sm","multiple":true,"regex":""}]}

It is easy to fix in Excel...

Hi, you have to create a wrapper element and place the data selectors inside the wrapper:

{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/page/[1-2]"],"selectors":[{"id":"media-cards","linkType":"linkFromHref","multiple":false,"parentSelectors":["wrapper"],"selector":"a.movie__title","type":"SelectorLink"},{"id":"videoland-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["media-cards"],"selector":"a.btn-md:nth-of-type(1)","type":"SelectorLink"},{"id":"ReleaseDates","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"a.btn-sm","type":"SelectorText"},{"id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":".movie","type":"SelectorElement"}]}

Thanks that worked.

One more question:
I have title set to link so it can follow the sub pages. Is there an option to hide the link of the title in the output like this?

{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/"],"selectors":[{"id":"Titels","linkType":"linkFromHref","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"selector":"a.movie__title","type":"SelectorLink"},{"id":"ReleaseDate","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"a.btn","type":"SelectorText"},{"id":"wrapper_for_Titels_ReleaseDates","multiple":true,"parentSelectors":["_root"],"selector":"div.movie:nth-of-type(n+3)","type":"SelectorElement"},{"id":"Genre","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"p.movie__option","type":"SelectorText"},{"id":"Speeltijd","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"p.movie__time","type":"SelectorText"},{"extractAttribute":"href","id":"Link Videolad","multiple":false,"parentSelectors":["Titels"],"selector":"a.btn-md:nth-of-type(1)","type":"SelectorElementAttribute"}]}

Why not just remove/hide the column in the output file?

Yes I can do that, but thought if there is an simple option to hide it while scraping, it would be better.
If it's not simple I will just hide it in excel.

If it is required to open the link, then it is not possible to avoid the href column.

Thanks.

I've encoutered that the scraper is skipping movies subpages when scraping.
On every page there are 12 movies/series. When I monitor the scraper it goes good for the first few pages but after that it randomly skipping subbpages. You can see in the output that on some pages it is less then 12.

i tried setting request interval and page load to 3000, but that didn't help.

The scraper will visit every unique URL only once. There are series with different episodes listed but with the same URL, thus they will be scraped only after the first occurrence.

I really need them to be included in the list. Is there an option to include them, even if they are already scraped once?

you can do it in 2 steps.... at first, you can collect ALL links on each page.... Secondly - you can gather all information from each movie's page... In Excel you can combine this data with your first step... easy )