How do I select only text with no HTML, for the entire page?

rninja · September 14, 2021, 7:28pm

I am trying to grab all visible text from the page in a selector per page. I haven't found a way to get text without HTML in it or I can only setup selectors for each text element, but then the format gets busted and it becomes kind of unwieldy. I really only want one field with all visible text on the page...
Hoping someone will know how to do this

ViestursWS · September 16, 2021, 3:00pm

@rninja

Hi, can you provide your sitemap or the targeted website?

rninja · November 19, 2021, 6:04pm

Hi @ViestursWS ! Let's take cnn.com as an example. I want to pull the body copy without HTML and have multiple instances saved to a CSV. I was using the HTML Type intially and then changed it to Text (while selecting "multiple" and ensuring that the selector attribute is set to the text tag type - P for example).

The question is, now that I do get text without HTML now, is there a way to collect all HTML text from multiple

, ,

, on the page as one cell's worth of data in the CSV or XLSX file? how would one collect those as a single selector?

leemeng · November 29, 2021, 12:09am

You can do it with the selector html or body (check site's source to see which is used), but this would be a pretty messy way to get data.

Type: Text
Selector: html