Need bit of help with scraping and formatting in the correct output

BRBNDR · January 2, 2025, 10:00am

Hi,

I'm successfully scraped the webpage and the child pages. But the release dates isn't in the same row as the release. How can I fix this so it's in the same row?

Thanks!

Url: Nieuw op Videoland in week 1 2025: nieuwe films en series - Nieuw Deze Week

Sitemap:
{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/page/[1-2]"],"selectors":[{"id":"media-cards","parentSelectors":["_root"],"type":"SelectorLink","selector":"a.movie__title","multiple":true,"linkType":"linkFromHref"},{"id":"videoland-link","parentSelectors":["media-cards"],"type":"SelectorLink","selector":"a.btn-md:nth-of-type(1)","multiple":false,"linkType":"linkFromHref"},{"id":"ReleaseDates","parentSelectors":["_root"],"type":"SelectorText","selector":"a.btn-sm","multiple":true,"regex":""}]}

don2010 · January 2, 2025, 12:10pm

It is easy to fix in Excel...

JanAp · January 2, 2025, 12:15pm

Hi, you have to create a wrapper element and place the data selectors inside the wrapper:

{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/page/[1-2]"],"selectors":[{"id":"media-cards","linkType":"linkFromHref","multiple":false,"parentSelectors":["wrapper"],"selector":"a.movie__title","type":"SelectorLink"},{"id":"videoland-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["media-cards"],"selector":"a.btn-md:nth-of-type(1)","type":"SelectorLink"},{"id":"ReleaseDates","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"a.btn-sm","type":"SelectorText"},{"id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":".movie","type":"SelectorElement"}]}

BRBNDR · January 2, 2025, 12:27pm

Thanks that worked.

One more question:
I have title set to link so it can follow the sub pages. Is there an option to hide the link of the title in the output like this?

{"_id":"NieuwDezeWeek-Videoland","startUrl":["https://nieuwdezeweek.nl/videoland/"],"selectors":[{"id":"Titels","linkType":"linkFromHref","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"selector":"a.movie__title","type":"SelectorLink"},{"id":"ReleaseDate","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"a.btn","type":"SelectorText"},{"id":"wrapper_for_Titels_ReleaseDates","multiple":true,"parentSelectors":["_root"],"selector":"div.movie:nth-of-type(n+3)","type":"SelectorElement"},{"id":"Genre","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"p.movie__option","type":"SelectorText"},{"id":"Speeltijd","multiple":false,"parentSelectors":["wrapper_for_Titels_ReleaseDates"],"regex":"","selector":"p.movie__time","type":"SelectorText"},{"extractAttribute":"href","id":"Link Videolad","multiple":false,"parentSelectors":["Titels"],"selector":"a.btn-md:nth-of-type(1)","type":"SelectorElementAttribute"}]}

JanAp · January 2, 2025, 12:44pm

Why not just remove/hide the column in the output file?

BRBNDR · January 2, 2025, 12:47pm

Yes I can do that, but thought if there is an simple option to hide it while scraping, it would be better.
If it's not simple I will just hide it in excel.

JanAp · January 2, 2025, 12:53pm

If it is required to open the link, then it is not possible to avoid the href column.

BRBNDR · January 2, 2025, 1:06pm

Thanks.

I've encoutered that the scraper is skipping movies subpages when scraping.
On every page there are 12 movies/series. When I monitor the scraper it goes good for the first few pages but after that it randomly skipping subbpages. You can see in the output that on some pages it is less then 12.

i tried setting request interval and page load to 3000, but that didn't help.

JanAp · January 2, 2025, 1:12pm

The scraper will visit every unique URL only once. There are series with different episodes listed but with the same URL, thus they will be scraped only after the first occurrence.

BRBNDR · January 2, 2025, 1:15pm

I really need them to be included in the list. Is there an option to include them, even if they are already scraped once?

don2010 · January 2, 2025, 1:19pm

you can do it in 2 steps.... at first, you can collect ALL links on each page.... Secondly - you can gather all information from each movie's page... In Excel you can combine this data with your first step... easy )