I am trying to grab all visible text from the page in a selector per page. I haven't found a way to get text without HTML in it or I can only setup selectors for each text element, but then the format gets busted and it becomes kind of unwieldy. I really only want one field with all visible text on the page...
Hoping someone will know how to do this
Hi @ViestursWS ! Let's take cnn.com as an example. I want to pull the body copy without HTML and have multiple instances saved to a CSV. I was using the HTML Type intially and then changed it to Text (while selecting "multiple" and ensuring that the selector attribute is set to the text tag type - P for example).
The question is, now that I do get text without HTML now, is there a way to collect all HTML text from multiple
, ,
, on the page as one cell's worth of data in the CSV or XLSX file? how would one collect those as a single selector?
You can do it with the selector html
or body
(check site's source to see which is used), but this would be a pretty messy way to get data.
Type: Text
Selector: html