Replace characters in text during scraping

CharlesLuck · October 4, 2021, 9:35pm

Hi,

I'm using the FireFox extension, and I'm hoping that it will be all of what I need.
I wasn't sure how to categorize this, so it remains uncategorized for now.

I am interested in scraping a site, and I require a CSV file in order to use the scraped data for importation to a website. We all know that a CSV file is comma delimited, and if there are commas in the text even if the text has has double quotes around it, my spreadsheet delimits the value and it continues in the next column of the spreadsheet. A manual cleanup defeats the whole purpose.

I ran into the same problem when in the middle of the description text there is a double quote [ " ] to mean inches. As you can imagine, in a product description if someone states [24" Monitor], it's not good.

I guess what I could use is some help figuring out how to replace internal commas and quotes with something else, perhaps a slash instead of a comma and the value [ in. ] instead of the double quote. Preferably, before it gets to the spreadsheet that actually does the CSV to spreadsheet conversion.

Any ideas? I did search all over for a regex to handle it for me, but learning regex from scratch is going to take be painful.

ViestursWS · October 5, 2021, 6:01am

@CharlesLuck Hi. Have you tried the 'Replace text' parser within Web Scraper Cloud?

https://webscraper.io/documentation/web-scraper-cloud/parser/replace-text

CharlesLuck · October 5, 2021, 3:02pm

Thanks for that Viesturs,

Based on my reading since posting this question, I came to realize that "Parser" is not a part of the equation when working with the free browser extension. Parser is for paying customers, and as with all "free" software products, there is a point at which you've been delivered to an "aha" moment.

That "aha" moment, invariably, is when you realize that you have to pay-up or shut-up. The "bait & switch" is the predominant modus operandi on the internet. That doesn't mean the free Parsehub isn't useful, but not if you run into clever websites, or poorly designed text editors that didn't take this into consideration. So in this case, on that website, I'm trying to scrape product descriptions that are full of internal commas that separate words, and double quotes to mean inches instead of the actual abbreviation "in.". I don't have to tell you, this makes the conversion of a csv file onto an OpenOffice spreadsheet beyond my ability to correct, by taking the contents of a description and fragmenting it across a number of columns; throwing off the entire spreadsheet. I was hoping for some kind of REGEX workaround on fields like that.

leemeng · October 6, 2021, 9:18am

If you have Web Scraper Dev, you can try the latest "export to .Xlsx". The excel format seems to handle commas and quotes better.

Dev version can be obtained at: