Scraped text is too long with tables inside

alex9575 · March 2, 2022, 2:32am

Hello. I'm a newbie here but learning fast. I am trying to scrape press releases like this one:

I need all the text from the article, but not the tables in it. When I try to scrape the article body with the selector "div.bw-release-story" or "[itemprop='articleBody']", the text scraped is incomplete. I guess it's just too long for an excel cell because of all the data in the tables. Well, is it possible that it's just too long? I tought of 2 ways to work around the issue:

Exclude the tables from my selection. I tried using :not(:has(div[class="bw-release-table-js"])) but I end up excluding the whole text if it contains tables. I may not be using it correctly?
Or is there a way to force Web Scraper to scrape all the text even if it's very long? Maybe split the text in multiple rows?

Thanks for your help.

ViestursWS · March 2, 2022, 4:44pm

@alex9575 Hi. In this case you could use the ''Grouped'' selector and divide the desired classes/id's by a comma; it is also possible to select multiple-unrelated elements by holding the SHIFT button.

Example:

{"_id":"businesswire-com","startUrl":["https://www.businesswire.com/news/home/20220301005233/en/Kohls-Reports-Fourth-Quarter-and-Full-Year-Fiscal-2021-Financial-Results"],"selectors":[{"delay":0,"extractAttribute":"","id":"txt","parentSelectors":["_root"],"selector":".bw-release-subhead li, [itemprop='articleBody'] > p:nth-of-type(n+2), ul.bwlistdisc:nth-of-type(2) li","type":"SelectorGroup"}]}