Scrape details page with pagination

Till_Uberfarbe · April 12, 2021, 1:16pm

Hi Community,
im trying to scrape the archive of a newspaper. To do that, i need to scrape the article name, then open the article and scrape the content. since there are many articles, I need to skip to the next page of articles, once the articles on the current page are scraped. Through this forum is was able to get this far (see below), but it just skips from page to page without scraping the articles. Any ideas on how to solve this?
Thanks so much in advance!

Url: Archiv – Politik Nachrichten – 2020 – Sueddeutsche.de -

Sitemap:
{"_id":"sz2020neu","startUrl":["https://www.sueddeutsche.de/archiv/politik/2020"],"selectors":[{"id":"artikel","type":"SelectorElement","parentSelectors":["neueseite"],"selector":"em > ","multiple":true,"delay":0},{"id":"inhalt","type":"SelectorText","parentSelectors":["artikel"],"selector":"p.css-13wylk3","multiple":true,"regex":"","delay":0},{"id":"neueseite","type":"SelectorLink","parentSelectors":["_root","neueseite"],"selector":".arrow a","multiple":true,"delay":0}]}

ViestursWS · April 12, 2021, 2:44pm

Hello @Till_Uberfarbe

You can make page range definition in URL from [1-100] page, then make an element selector for each of the articles and then add links which would lead in the description.

How to make page range definition: Web Scraper pagination tutorial - YouTube example starts@0:35
Documentation: Installation | Web Scraper Documentation
How to's: Web Scraper << How to >> video tutorials

Something like this:

{"_id":"sz2020neu","startUrl":["https://www.sueddeutsche.de/archiv/politik/2020/page/[1-100]"],"selectors":[{"id":"article","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.entrylist__entry","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["article"],"selector":"em.entrylist__title","multiple":false,"regex":"","delay":0},{"id":"link","type":"SelectorLink","parentSelectors":["article"],"selector":"a","multiple":false,"delay":0},{"id":"content","type":"SelectorText","parentSelectors":["element-card"],"selector":"article.lp_is_start","multiple":false,"regex":"","delay":0},{"id":"element-card","type":"SelectorElement","parentSelectors":["link"],"selector":"body:has(article#readspeaker-content)","multiple":true,"delay":0},{"id":"foto","type":"SelectorImage","parentSelectors":["element-card"],"selector":"[data-hydration-component-name=\"ImageAsset\"] img","multiple":false,"delay":0}]}

Hope it helps.

Till_Uberfarbe · April 12, 2021, 4:00pm

thanks so much, this actually works!