WS inserts CR (\n) inside fields when exporting CSV

Jesus_Medina · February 16, 2019, 4:01pm

Web Scraper version: 0.3.8
Chrome version: Version 72.0.3626.109 (Official Build) (64-bit)
OS: OS X El Capitán 10.11.6 (15G22010)

Sitemap:
{"_id":"airlines_manager","startUrl":["https://www.airlines-manager.com/aircraft/buy/new"],"selectors":[{"id":"short-button","type":"SelectorLink","parentSelectors":["_root"],"selector":"div.middle a:nth-of-type(1)","multiple":false,"delay":0}]}
Error Message:

WS inserts CR (\n) in the middle of fields and destroys the CSV.

A CSV file delimited by comas only can have an CR at the end of each register, never in the middle of a field!!!

If I parse this HTML:

<a href="/aircraft/buy/new/0/short" class="">
                <img src="/images/icons/cc.png?v1.6.11" title="Short haul" alt="Short haul">
                <span class="short">SH</span>
                <span class="long">Short haul</span>
            </a>

The result in CSV is:

web-scraper-order,web-scraper-start-url,short-button,short-button-href
"1550332307-1827","https://www.airlines-manager.com/aircraft/buy/new","SH
                Short haul","https://www.airlines-manager.com/aircraft/buy/new/0/short"


You can see the <CR> after the text 'SH'. Is a nightmare, because I can't import the CSV to excel.

Please solve it asap!!!!

Thanks

To access error messages follow these steps:

Open chrome://extensions/ or go to manage extensions
Enable “developer mode” at the top right
Open Web Scrapers “background page”
A new popup window should appear.
Go to “Console” tab. You should see Web Scraper log messages and errors there.

leemeng · August 25, 2019, 1:41pm

Hi, this is not really a bug as the linefeed character is valid within CSV fields (you can check the specs). BTW CR is carriage return, not linefeed (you would normally see it followed by a linefeed, CR/LF). In fact you will see such CSV files if you export Google contacts and some fields (e.g. Notes) have multi-lines (means they contain LFs). You can also open such CSV files in Excel, and you will see that Excel will correctly handle fields with LFs (i.e. they are valid).

Also, I don't think WS is "inserting" any characters; most likely the element you are scraping contains LFs so WS will just scrape up those too.

The workaround is to insert an uncommon character (e.g.¥, the yen symbol) at the start of each row to act as a marker, and use that as delimiter character for parsing. The web-scraper-order number is quite distinct, so you can do a straightforward search-n-replace:
find: "155033
replace: ¥"155033
Then use ¥ as the row delimiter instead of LF.