Every paragraph a new row?

Hey Community,

im trying to scrape a german newspaper archive. Everything works fine but when i get to the actual content (text), i can only scrape the first paragraph. If I select "multiple" for the text selector, each paragraph is listed in a new row. Is there a fix for this, so i can scrape whole articles in one row?
Thanks in advance!

This is my Code:
{"_id":"sz012020","startUrl":["https://www.sueddeutsche.de/archiv/politik/2020/01/page/[1-100]"],"selectors":[{"id":"article","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.entrylist__entry","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["article"],"selector":"em.entrylist__title","multiple":false,"regex":"","delay":0},{"id":"link","type":"SelectorLink","parentSelectors":["article"],"selector":"a","multiple":false,"delay":0},{"id":"head","type":"SelectorText","parentSelectors":["link"],"selector":"span.css-1r9juou","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["link"],"selector":"time","multiple":false,"regex":"","delay":0},{"id":"content","type":"SelectorText","parentSelectors":["link"],"selector":"p.css-13wylk3","multiple":true,"regex":"","delay":0}]}

Hi @Till_Uberfarbe

For the content selector i would use a text selector - div[data-testid="article-body"] with the "multiple" option not checked.

1 Like

Hey Viesturs, I did that but then it only scrapes the first paragraph instead of all of them. Any more ideas?

Hi @Till_Uberfarbe

Then you can keep the previous selector but replace the "Text" selector with a "Grouped" and afterward you can post-process the data by using the Web Scraper Cloud parser feature.

Helpful resources:

https://webscraper.io/documentation/selectors/grouped-selector
https://webscraper.io/documentation/web-scraper-cloud/parser

Hope it helps.

1 Like

sorry for the late reply, it worked beautifully. Thanks a lot!