Website scraping, but in the export, output is only 5 pages

Hello, website scraping, but in the export, output is only 5 pages.
I found that if I delete the venue_address, venue_address column it exports xx pages. Do you know why?

Url: https://www.ticketmaster.cz

Sitemap:
{"_id":"1Aticketmaster-all-events","startUrl":["https://www.ticketmaster.cz/"],"selectors":[{"id":"category_links","multiple":true,"parentSelectors":["_root"],"selector":"a.sc-qojxtn-0","type":"SelectorLink"},{"id":"event_links","multiple":true,"parentSelectors":["category_links","pagination"],"selector":"a.jdbotF","type":"SelectorLink"},{"id":"event_title","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"h1.sc-1eku3jf-14","type":"SelectorText"},{"id":"event_image","multiple":false,"parentSelectors":["event_links"],"selector":"img","type":"SelectorImage"},{"id":"event_date","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-1eku3jf-16 [data-testid] span","type":"SelectorText"},{"clickElementSelector":".sc-1eku3jf-15 span","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":3000,"discardInitialElements":"do-not-discard","id":"click_more_info","multiple":true,"parentSelectors":["event_links"],"selector":".sc-1eku3jf-15 span","type":"SelectorElementClick"},{"id":"sub_event_title","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"h3","type":"SelectorText"},{"id":"sub_event_date","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-l54kkq-1 span","type":"SelectorHTML"},{"id":"sub_event_address","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-l54kkq-0:nth-of-type(2) span.sc-l54kkq-1","type":"SelectorText"},{"id":"sub_event_info","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-1vj56w2-2","type":"SelectorText"},{"id":"sub_event_organizer","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-1vj56w2-0:nth-of-type(3) p","type":"SelectorText"},{"id":"event_category","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-1eku3jf-17 li:nth-of-type(2)","type":"SelectorText"},{"clickElementSelector":"a.sc-htoDjs","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":5000,"discardInitialElements":"do-not-discard","id":"pagination","multiple":true,"parentSelectors":["category_links","pagination"],"selector":"a.sc-htoDjs","type":"SelectorElementClick"}]}

Thank you

I still haven't been able to figure it out unfortunately :confused:
Unfortunately, I need the exact address. And I haven't figured out how else to get it.
Anyone, please .................................................................................... ?

Can you give me some more advice, I don't know what to do.
I still don't know how to solve the addresses and I need it to be like a link, since the address is only on the next page.
e.g. https://www.ticketmaster.cz/event/fil-bo-riva-support-hoehn-vstupenky/50647 - address page https://www.ticketmaster.cz/venue/bike-jesus-praha-7-vstupenky/bikejesus/105

Hi,

Did you check this topic?

Hi,
Yes, I've read that, but I'm not wise to it. I don't know how to apply it to the link.
For Example:
Could you help me modify the sitemap that goes here, so that the address is extracted from the page https://www.ticketmaster.cz/venue/bike-jesus-praha-7-vstupenky/bikejesus/105?

A click on the address link will not be viable for the same reasons explained in the other post.

The address can be found in the script element, but it will require post-processing to extract it from the script.

{"_id":"1Aticketmaster-all-events","startUrl":["https://www.ticketmaster.cz/"],"selectors":[{"id":"category_links","multiple":true,"parentSelectors":["_root"],"selector":"a.sc-qojxtn-0","type":"SelectorLink"},{"id":"event_links","multiple":true,"parentSelectors":["category_links","pagination"],"selector":"a.jdbotF","type":"SelectorLink"},{"id":"event_title","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"h1.sc-1eku3jf-14","type":"SelectorText"},{"id":"event_image","multiple":false,"parentSelectors":["event_links"],"selector":"img","type":"SelectorImage"},{"id":"event_date","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-1eku3jf-16 [data-testid] span","type":"SelectorText"},{"clickElementSelector":".sc-1eku3jf-15 span","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":3000,"discardInitialElements":"do-not-discard","id":"click_more_info","multiple":true,"parentSelectors":["event_links"],"selector":".sc-1eku3jf-15 span","type":"SelectorElementClick"},{"id":"sub_event_title","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"h3","type":"SelectorText"},{"id":"sub_event_date","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-l54kkq-1 span","type":"SelectorHTML"},{"id":"sub_event_address","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-l54kkq-0:nth-of-type(2) span.sc-l54kkq-1","type":"SelectorText"},{"id":"sub_event_info","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-1vj56w2-2","type":"SelectorText"},{"id":"sub_event_organizer","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"div.sc-1vj56w2-0:nth-of-type(3) p","type":"SelectorText"},{"id":"event_category","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":".sc-1eku3jf-17 li:nth-of-type(2)","type":"SelectorText"},{"clickElementSelector":"a.sc-htoDjs","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":5000,"discardInitialElements":"do-not-discard","id":"pagination","multiple":true,"parentSelectors":["category_links","pagination"],"selector":"a.sc-htoDjs","type":"SelectorElementClick"},{"id":"script","multiple":false,"parentSelectors":["event_links"],"regex":"","selector":"script[type=\"application/ld+json\"]:contains('Address')","type":"SelectorText"}]}

Great, thank you.
It looks like it works, but I don't know how to use script[type="application/ld+json"]:contains('Address') on other event websites where the event details need to get to another page (link) where the address is.

Example:
Detail Štístko a Poupěnka - Pojď si hrát! | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport
Address [KD Šeříkovka] - KD Šeříkovka | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport

This solution will not work with ticketportal.cz, the address is not available in the HTML.

OH. The address is on each ticket separately see.
KD Šeříkovka KD Šeříkovka | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport
House of Culture Dům kultury | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport
Hvězda Cinema Kino Hvězda | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport
You would have to click (link) to the address page. Where is the address, but when I made a selector link so and put selector text there it didn't work. I am currently addressing this on multiple pages.

You can try a different approach and start by visiting the venue pages (from the sitemap.xml) and going into the event links from there:

{"_id":"ticketportal-cz","startUrl":["https://www.ticketportal.cz/"],"selectors":[{"id":"sitemap","parentSelectors":["_root"],"sitemapXmlMinimumPriority":"0.1","sitemapXmlUrlRegex":"venue/","sitemapXmlUrls":["https://www.ticketportal.cz/Home/Sitemap"],"type":"SelectorSitemapXmlLink"},{"id":"venue","multiple":false,"parentSelectors":["sitemap"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"address","multiple":false,"parentSelectors":["sitemap"],"regex":"","selector":"tr:contains('Address:') td:nth-of-type(2)","type":"SelectorText"},{"id":"event","linkType":"linkFromHref","multiple":true,"parentSelectors":["sitemap"],"selector":"a[itemprop='name']","type":"SelectorLink"},{"id":"event-name","multiple":false,"parentSelectors":["event"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"date","multiple":false,"parentSelectors":["event"],"regex":"","selector":"tr:contains('Date:') td:nth-of-type(2)","type":"SelectorText"}]}

I'll try, thank you very much :slight_smile:

When I tried to run the sitemap, the address is empty. And it should always be extracted from the place (address) page e.g. Kino Hvězda | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport which can be accessed from the event detail e.g. KD Šeříkovka | TICKETPORTAL Vstupenky na Dosah – divadlo, hudba, koncert, festival, muzikál, sport

There was a language issue, try this sitemap:

{"_id":"ticketportal-cz","startUrl":["https://www.ticketportal.cz/"],"selectors":[{"id":"sitemap","parentSelectors":["_root"],"sitemapXmlMinimumPriority":"0.1","sitemapXmlUrlRegex":"venue/","sitemapXmlUrls":["https://www.ticketportal.cz/Home/Sitemap"],"type":"SelectorSitemapXmlLink"},{"id":"venue","multiple":false,"parentSelectors":["sitemap"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"address","multiple":false,"parentSelectors":["sitemap"],"regex":"","selector":"tr:contains('Address:') td:nth-of-type(2), tr:contains('Adresa:') td:nth-of-type(2)","type":"SelectorText"},{"id":"event","multiple":true,"parentSelectors":["sitemap"],"selector":".ticket-cover","type":"SelectorElement"},{"id":"event-name","multiple":false,"parentSelectors":["event"],"regex":"","selector":"a[itemprop='name']","type":"SelectorText"},{"id":"date","multiple":false,"parentSelectors":["event"],"regex":"","selector":".date","type":"SelectorText"}]}

Works great, but I just noticed that there is no picture of the event and which category it is in.

Please, is there any way to do this? Add an image, a category?

Any idea........................................................???????????

Any idea how to get a picture of the event?........................................................???????????

Hi, you will not be able to open the event links and venue links within one sitemap because they cross-reference and get discarded due to the deduplication feature.

Try setting up a separate sitemap for the events and merge the output files by using a unique identifier.