Text selector returning text + HTML

FrankSmith · November 22, 2022, 4:55am

Web Scraper version: 0.6.5

Chrome version: Version 107.0.5304.107 (Official Build) (64-bit)

OS: Windows 11

Sitemap:

{"_id":"test-text","startUrl":["https://www.beatlesbible.com/1957/07/06/john-lennon-meets-paul-mccartney/"],"selectors":[{"id":"Content","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"div.thecontent","type":"SelectorText"}]}

Error Message: None

I don't know for sure this is a bug, but it's not returning the result I expect. When scraping, the data returned includes two <img ....> attributes in addition to the text. I expect it only to return the text. Is this a bug? Incorrectly formatted HTML? User error? Something else?

Thanks,
Frank

ViestursWS · November 22, 2022, 3:09pm

@FrankSmith Hi, it appears that the targeted block contains a 'noscript' tag that has this data in a 'Text' format.

FrankSmith · November 23, 2022, 12:09am

@ViestursWS, thanks for your quick response. Is it generally expected that any text within attributes would be returned? As I understand it, this text would only be displayed in the browser if scripts are not supported. Clearly, the webscraper extension uses a browser that does support scripts.

ViestursWS · November 23, 2022, 4:15pm

@FrankSmith Accessing this page and this exact data from different browsers does not make a difference. The data within 'noscript' tag is perceived as being a regular text.

If you are looking to discard such paragraphs, you can use the following 'Grouped' selector - div.thecontent p:not(:has(noscript))

Example:

{"_id":"test-text","startUrl":["https://www.beatlesbible.com/1957/07/06/john-lennon-meets-paul-mccartney/"],"selectors":[{"extractAttribute":"","id":"Content","parentSelectors":["_root"],"selector":"div.thecontent p:not(:has(noscript))","type":"SelectorGroup"}]}

FrankSmith · November 25, 2022, 9:23pm

@ViestursWS, Thanks for the example of using the Grouped selector. Unfortunately, using that approach would still require post processing to get only the wanted text. A usable work-around though.

My original question, I think, still applies. Should this be considered a bug with the webscraper extension?