Csv exported file is broken

drspock · July 4, 2018, 7:19am

Web Scraper version: 0.3.7
Chrome version: 67.0.3396.99
OS: Windows 10

Sitemap:

{"_id":"electronicloisirs","startUrl":["https://electronicloisirs.com/"],"selectors":[{"id":"categorie","type":"SelectorLink","selector":"ul.sf-menu > li.col-lg-2 > a.sf-with-ul","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"produit","type":"SelectorLink","selector":"div.right-block a.product-name","parentSelectors":["pagination"],"multiple":true,"delay":0},{"id":"titre","type":"SelectorText","selector":"h1","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"Reference","type":"SelectorText","selector":"p#product_reference span.editable","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"prix","type":"SelectorText","selector":"p.our_price_display span.price","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"ensavoirplus","type":"SelectorText","selector":"div.rte","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"image","type":"SelectorImage","selector":"span img","parentSelectors":["produit"],"multiple":false,"delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.bottom-pagination-content li:nth-of-type(n+2) a","parentSelectors":["categorie","pagination"],"multiple":true,"delay":0},{"id":"sous_cat_1","type":"SelectorText","selector":"span:nth-of-type(3) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_2","type":"SelectorText","selector":"span:nth-of-type(5) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_3","type":"SelectorText","selector":"span:nth-of-type(7) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"image_2","type":"SelectorImage","selector":"li#thumbnail_253131 img.img-responsive","parentSelectors":["produit"],"multiple":false,"delay":0},{"id":"image_3","type":"SelectorImage","selector":"li.last img.img-responsive","parentSelectors":["produit"],"multiple":false,"delay":0}]}

Hello,

Everything seems correct when I scrape the website. The problem occurs when I export to csv. It looks like the CSV file is not correct or take break lines into account. I know I can just copy paste the table of the browse panel from webscraper to excel but I still want to be able to export to csv from webscraper. Can you check that? It looks like the problem is in the "ensavoirplus" column. Am I doing something wrong?

Please try to scrape a part of the site and then export to excel. I would like to join the csv to this message but I think file upload is not allowed.

Thanks for your help.

This chrome extension is really amazing, thanks for having done it1

Regards

Error Message:

No error message

To access error messages follow these steps:

Open chrome://extensions/ or go to manage extensions
Enable “developer mode” at the top right
Open Web Scrapers “background page”
A new popup window should appear.
Go to “Console” tab. You should see Web Scraper log messages and errors there.

iconoclast · July 4, 2018, 9:01pm

Hi!

Sometimes the fields on a website generated using JavaScript. It will look okay on screen and even when you press Preview data, but when scraped there's a lot of generated white space that is not visible on website.

Example: (white space is not visible on screen)
<span class="c-delivery__when"> from 96 shops starting from 05.07, </span>

When scraped:

		from 96 shops starting from 05.07, 
	</span>

The white space is U+0009 : <control-0009> (CHARACTER TABULATION [TAB]) {horizontal tabulation (HT), tab}.

You have to check if a CSS selector is set to a grouped div that contains JavaScript generated values, and perhaps set it to a part of it so no white space will be picked.

Another way of working with above mentioned problem is to use JavaScript extension (like Tampermonkey) to redraw fields without white space.

P.S. you can also try to use REGEX within WebScraper, to simply scrape only text/numbers avoiding white space in the beginining/ending of the line. I use REGEXPAL for testing regex.

drspock · July 5, 2018, 5:01pm

Hello and thanks for your answer. Do you have regex code on hand? I have been trying several codes for more than two hours without success... how do you activate regex multiline and global flags within webscraper?

iconoclast · July 5, 2018, 7:16pm

Unfortunately RegEx flags are not supported by WebScraper yet. You have to either find a workaround regex (can be lenghty one), or rather use Tampermonkey to force redraw text fields without white space.

Please post some of the values here, I'll try to help you with regex.

drspock · July 6, 2018, 4:11pm

No chance with regex yet, I am trying with tampermonkey but seems somewhat difficult for me. two another questions, how would you scrape the link to the product images of the site https://electronicloisirs.com? Last question, how much would you ask to scrape all the products of this site and send me the csv file? The site has about 20000 products.

Thanks

iconoclast · July 6, 2018, 4:39pm

I'd help you for free but the website gives me 503 error each time i try to connect even if I try using VPN.

If your scrape is that big i would recommend trying out the Cloud WebScraper https://webscraper.io/service

Besides, you can try to divide text selectors into parts that pick exactly words (not a phrase, like "available in store" but "available""in""store") as a workaround.

Another workaround is to edit exported CSV file within Notepad++ and just replace white space with single space (can be done with regex within Notepad++).

Please send me your csv file so i can take a look.

drspock · July 9, 2018, 8:13am

Thanks for your answer. How can I send you the csv? I can only upload images here.

drspock · July 9, 2018, 8:47am

Also, I would like to scrape the link to each image of the product. On the site, the main image is shown when the visitor hovers on the thumbnail with his mouse (he does not need to click). Basically, I would like to have the link highlighted in the picture I have attached to this message, for each image. Note that the product page does not always have 2 images (can even have no image at all).

Thanks for your help. I will definitely think to use the cloud scraper tool once I am sure that everything will scraped as I want. Do you have an idea of how long it would take to scrape 20000 products with the cloud scraper?

drspock · July 10, 2018, 10:01am

I have tried the method you suggested in the thread Scrape all amazon products images? to scrape the images
You answered this guy and it seemed to work, but I still have no success... Is there anything I am doing wrong?
{"_id":"electronicloisirs","startUrl":["https://electronicloisirs.com/"],"selectors":[{"id":"categorie","type":"SelectorLink","selector":"ul.sf-menu > li.col-lg-2 > a.sf-with-ul","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"produit","type":"SelectorLink","selector":"div.right-block a.product-name","parentSelectors":["pagination"],"multiple":true,"delay":0},{"id":"titre","type":"SelectorText","selector":"h1","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"Reference","type":"SelectorText","selector":"p#product_reference span.editable","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"prix","type":"SelectorText","selector":"p.our_price_display span.price","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"ensavoirplus","type":"SelectorText","selector":"div.rte","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.bottom-pagination-content li:nth-of-type(n+2) a","parentSelectors":["categorie","pagination"],"multiple":true,"delay":0},{"id":"sous_cat_1","type":"SelectorText","selector":"span:nth-of-type(3) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_2","type":"SelectorText","selector":"span:nth-of-type(5) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_3","type":"SelectorText","selector":"span:nth-of-type(7) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"images","type":"SelectorElementClick","selector":"#view_full_size","parentSelectors":["produit"],"multiple":true,"delay":"2000","clickElementSelector":"#thumbs_list li","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"imageseule","type":"SelectorImage","selector":"img","parentSelectors":["images"],"multiple":false,"delay":"500"}]}

iconoclast · July 10, 2018, 11:10am

Hi,

i've managed to use proxy with success.
Images are loaded within 'fancybox' once one of the images is pressed, and then changed once 'Next'('Right') button is pressed. I've created an Element Click for you that calls 'Fancybox' to pop-up, then it clicks 'right' for next image. It will scrape all available images.

I'm sorry i am at work and limited in time

Sitemap:
{"_id":"electronicloisirs-images","startUrl":["https://electronicloisirs.com/index.php?id_product=30104&controller=product"],"selectors":[{"id":"cliker","type":"SelectorElementClick","selector":"img.fancybox-image","parentSelectors":["_root"],"multiple":true,"delay":"1500","clickElementSelector":"span img, a.fancybox-nav.fancybox-next","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"images","type":"SelectorImage","selector":"_parent_","parentSelectors":["cliker"],"multiple":true,"delay":0}]}

drspock · July 17, 2018, 8:21am

Hello,

I am still working on this same site and I am trying to figure out why something is not working. i have noticed that the urls contains the page numbers, so I decided to select pages using page number, and scraping each category at one time. However, the scraping stops after some pages. In the example I am sending you, scraping stops at page 22 (starting from end), can you tell me why?

{"_id":"pag_electronicloisirs_video","startUrl":["https://electronicloisirs.com/index.php?id_category=1351&controller=category"],"selectors":[{"id":"produit","type":"SelectorLink","selector":"div.right-block a.product-name","parentSelectors":["pagination"],"multiple":true,"delay":0},{"id":"titre","type":"SelectorText","selector":"h1","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"Reference","type":"SelectorText","selector":"p#product_reference span.editable","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"prix","type":"SelectorText","selector":"p.our_price_display span.price","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"ensavoirplus","type":"SelectorText","selector":"div.rte","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"image","type":"SelectorImage","selector":"span img","parentSelectors":["produit"],"multiple":false,"delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.bottom-pagination-content li:not(:last-child) a","parentSelectors":["_root","pagination"],"multiple":true,"delay":0},{"id":"sous_cat_1","type":"SelectorText","selector":"span:nth-of-type(3) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_2","type":"SelectorText","selector":"span:nth-of-type(5) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"sous_cat_3","type":"SelectorText","selector":"span:nth-of-type(7) span","parentSelectors":["produit"],"multiple":false,"regex":"","delay":0},{"id":"images","type":"SelectorLink","selector":"a.fancybox","parentSelectors":["produit"],"multiple":true,"delay":0},{"id":"ymages","type":"SelectorImage","selector":"img.fancybox-image","parentSelectors":["produit"],"multiple":false,"delay":0},{"id":"linkimges","type":"SelectorLink","selector":"a.fancybox","parentSelectors":["produit"],"multiple":true,"delay":"1000"}]}