Link formatting

Hello all,

I've got a problem with a link formatting making my csv file messy.
I set a sitemap up to pick products' detail such as description, price, ref, ean... it works perfectly well, but when I export to csv, it's a mess because of new line, carriage return or any other special characters that are in my 'range' selector.

I test with the first page, before going through all the others:
Url: https://www.lampesdirect.fr/catalogsearch/result/?q=LEDVANCE&p=1

Sitemap:
{"_id":"lampesdirect_ledvance","startUrl":["https://www.lampesdirect.fr/catalogsearch/result/?q=LEDVANCE&p=1"],"selectors":[{"id":"range","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.result","multiple":true,"delay":0},{"id":"subrange","type":"SelectorLink","parentSelectors":["range"],"selector":"[itemprop='name'] a","multiple":true,"delay":0},{"id":"p_description","type":"SelectorText","parentSelectors":["subrange"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"p_price","type":"SelectorText","parentSelectors":["subrange"],"selector":".current-price div","multiple":false,"regex":"[0-9]+\,[0-9]+","delay":0},{"id":"p_reference","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('Réf.') td.data","multiple":false,"regex":"","delay":0},{"id":"p_manuf","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('Nom du fabricant') td.data","multiple":false,"regex":"","delay":0},{"id":"p_ean","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('EAN') td.data","multiple":false,"regex":"","delay":0},{"id":"p_picture","type":"SelectorImage","parentSelectors":["subrange"],"selector":"img[itemprop='image']","multiple":false,"delay":0}]}

When I preview the 'range' data, I've got:

When I download the CSV, I've got this:

Ideally, I'd like to have the string before 'À partir de:'

I'm stuck with this one.
Any help will be really appreciated.

Thank you
David

Can't import your sitemap, "invalid JSON". Anyway the results look like 'range' has the multiple option checked.

Thank you leemeng, but I think there is a misunderstanding / a bad explanation of my problem.

Sitemap: I imported it into webscraper and there is no error with this one.
{"_id":"lampesdirect_ledvance","startUrl":["https://www.lampesdirect.fr/catalogsearch/result/?q=LEDVANCE&p=1"],"selectors":[{"id":"range","type":"SelectorLink","parentSelectors":["_root"],"selector":"div:nth-of-type(n+5) a[itemprop='url']","multiple":true,"delay":0},{"id":"subrange","type":"SelectorLink","parentSelectors":["range"],"selector":"[itemprop='name'] a","multiple":true,"delay":0},{"id":"p_description","type":"SelectorText","parentSelectors":["subrange"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"p_price","type":"SelectorText","parentSelectors":["subrange"],"selector":".current-price div","multiple":false,"regex":"[0-9]+\,[0-9]+","delay":0},{"id":"p_reference","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('Réf.') td.data","multiple":false,"regex":"","delay":0},{"id":"p_manuf","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('Nom du fabricant') td.data","multiple":false,"regex":"","delay":0},{"id":"p_ean","type":"SelectorText","parentSelectors":["subrange"],"selector":"tr:contains('EAN') td.data","multiple":false,"regex":"","delay":0},{"id":"p_picture","type":"SelectorImage","parentSelectors":["subrange"],"selector":"img[itemprop='image']","multiple":false,"delay":0}]}

The selector I call 'range' is actually a link (red frame) to a page of products, it's why I used link selector and select 'Multiple'

Data preview looks fine, but CSV is messy.


On a Mac, CSV is well formatted - I mean 'range' value is in one single cell and doesn't mess the other data, so I can use the data that are relevant for me: p_reference, p_manuf, p_ean and p_price.

I don't think it's possible to get ride off the carriage returns as there is no regex for link selector. Is there a way to get ride off the 'unuseful' data when downloading the CSV?

Thank you
David

I still can't import your sitemap, but I think I see the problem. I generated a new sitemap from scratch and got the same large blocks of text and blank spaces as you did. The initial selector I used was a.result which turns out to be overly broad and would grab a lot of text. I changed it to div.result-sub-content > h3 and it should now produce the results you wanted. Pls modify as needed:

{"_id":"lampesdirect_only_links","startUrl":["https://www.lampesdirect.fr/catalogsearch/result/?q=LEDVANCE&p=1"],"selectors":[{"id":"Item links to click","type":"SelectorLink","parentSelectors":["_root"],"selector":"div.result-sub-content > h3","multiple":true,"delay":0}]}

Hi all, I wonder why you would go about it this complicated way , because you could simply get the sitemap you need from any xml file provided by the site itself listed in robots.txt
https://www.lampesdirect.fr/robots.txt

If you go to
https://www.lampesdirect.fr/sitemap_lampesdirect_fr.xml
you will find a list of all the pages, including the catalog pages (some couple hundred, but after that it list the product pages)
Then you could just parse the individual product pages and that would be a lot easier and it has unique id as Réf. 229341

1 Like

Thanks for the info on robots.txt and sitemaps. I've been meaning to explore more on those methods. This particular problem is more of a selector issue and not a navigation issue. But the sitemap could be useful too.

Hello leemeng,

I did try this div.result-sub-content > h3 as Link selector and yes it does remove the unexpected character but the scrape endes because there is no href value to follow.

I found a workaround that is not the ideal solution.
Thank you very much for your help.

David

Hello Zerofirefox,

Thank you for sharing this but I'm wondering how did you find this robot.txt and sitemap details?
I have other websites to scrape and if I can find this kind of informations, this might help me a lot.
I'll have a look at this sitemap and how to exploit it.

David

Hi David, that was merely a demo scraper I made from scratch (no child elements). You would need to modify the selector on your original sitemap.

Bretfeig did a video on these topics and he provides a step-by-step guide on how to find a sitemap and how to use it with WS

:+1: Thanks leemeng, your help is much appreciated!