How to scrape duplicate links?

lustek · August 29, 2018, 2:04pm

I want to scrape products from tesco.com. They add products to multiple categories. Unfortunately webscraper omits duplicates. How to alter my script to force harvesting of 27 duplicate results rather than only 20 unique ones?

{"_id":"tesco_kids","startUrl":["https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/all?count=48&department=Haircare&viewAll=department%2Caisle%2Cshelf&aisle=Shampoo&shelf=Kids%20Shampoo","https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/all?count=48&department=Haircare&viewAll=department%2Caisle%2Cshelf&aisle=Kids%20Haircare&shelf=Kids%20Shampoo"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":"li.pagination-btn-holder:nth-of-type(n+2) a.prev-next","multiple":false,"delay":""},{"id":"product","type":"SelectorElement","parentSelectors":["_root","pagination"],"selector":"li.product-list--list-item","multiple":true,"delay":"0"},{"id":"link","type":"SelectorLink","parentSelectors":["product"],"selector":"a.product-tile--title","multiple":false,"delay":"2000"},{"id":"pagination page","type":"SelectorText","parentSelectors":["pagination"],"selector":"a.pagination--button.disabled.highlight span:nth-of-type(1)","multiple":false,"regex":"","delay":0}]}

iconoclast · August 29, 2018, 9:13pm

Hi!

WebScraper will skip same links inside your sitemap, the only workaround for this (a perfect example if one product has two colors but link is the same) is either using encoded URL (you've used it) or adding any extra code to the URL (if website does support it, like 'grid=8' or anything similar that will make link unique).

Do you mean, that same products exist in different categories? Did you check if links to them are same as well?

Your sitemap seem to be invalid cause of improperly posted RegEx, please re-add your sitemap using Preformatted text button (Shortcut - Ctrl+Shift+C).

I've done a test scrape using quickly made sitemap, and it displays all 238 results (used 'shampoo' as a search word).

Please try this one out:

 {"_id":"tesco_test","startUrl":["https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/haircare/shampoo/all?page=[1-5]"],"selectors":[{"id":"grouping","type":"SelectorElement","selector":"div.tile-content","parentSelectors":["_root"],"multiple":true,"delay":"1000"},{"id":"Text","type":"SelectorText","selector":"a.product-tile--title","parentSelectors":["grouping"],"multiple":false,"regex":"","delay":0}]}

lustek · August 30, 2018, 7:58am

Hi, I updated my sitemap as preformatted text.

Yes, same products exist in different categories. And I need to know all of them, not only the first one.

In my case I have 2 different categories with the same products:

Health & Beauty > Haircare > Shampoo > Kids Shampoo
https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/all?count=48&department=Haircare&viewAll=department%2Caisle%2Cshelf&aisle=Shampoo&shelf=Kids%20Shampoo
Health & Beauty > Haircare > Kids haircare > Kids Shampoo
https://www.tesco.com/groceries/en-GB/shop/health-and-beauty/all?count=48&department=Haircare&viewAll=department%2Caisle%2Cshelf&aisle=Kids%20Haircare&shelf=Kids%20Shampoo

lustek · August 30, 2018, 8:42am

ok, I managed to force link uniqueness with tampermonkey scipt:

// ==UserScript==
// @name         Tesco unique links
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  try to take over the world!
// @author       Lukasz Stachowicz
// @include      https://www.tesco.com/*
// @grant        none
// ==/UserScript==



    (function (){
      'use strict';
    var anchor=document.getElementsByClassName("product-tile--title");
    for(var i = 0; i < anchor.length; i++){
        anchor[i].href=anchor[i].href + "#from=" +window.location.href ;
    };
})()

Drusez · October 9, 2019, 10:43am

@lustek

Thank you for your answer :), interesting to see other people have the same problem.

Could you elaborate a bit though, as to how this tampermonkey extension works? I downloaded it into Google Chrome and copied your code into it, but it still doesn't seem to work. When I click on the icon at the top right, it says "no script is running", but when I view the dashboard, it says enabled.

Could you please shed some light on what I'm doing wrong?

Thank you so much!

lustek · October 9, 2019, 11:18am

I created my userscript to work only on tesco.com. On pageload it scans for all links on the page and appends #from anchor at the end to force uniquness required by webscaper.

Change this line to the domain you want to scrape:

// @include      https://www.tesco.com/*

when you create a sitemap in webscaper remember to set proper delays. Userscript needs time to be executed before scaping begins.

Drusez · October 9, 2019, 1:31pm

Thank you so much for getting back to me lustek, even after the topic was dormant for a year!

It still doesn't work though, it's actually ASDA's site not Tesco's, and the structure is completely different, I have to use a Element Scroll Down rather than Element, and I thought I have changed the "getElementsByClassName" bit of code accordingly, but it's still not right. I suspect it's this "getElements" bit I also have to change, but not sure what to.

Would you mind taking a quick look at my code, it will probably only take you a minute (I spent almost the whole morning on it xD):

Thank you so much!

Tampermonkey:

// ==UserScript==
// @name New Userscript
// @namespace https://groceries.asda.com
// @version 0.1
// @description try to take over the world!
// @author You
// @include https://groceries.asda.com/
// @grant none
// ==/UserScript==

(function (){
  'use strict';
var anchor=document.getElementsByClassName("#listingsContainer div.product-content");
for(var i = 0; i < anchor.length; i++){
    anchor[i].href=anchor[i].href + "#from=" +window.location.href ;
};

})()

ASDA Site Map:
{"id":"atp-asda","startUrl":["https://groceries.asda.com/aisle/vegetarian-free-from/vegetarian//103590"],"selectors":[{"id":"section","type":"SelectorLink","parentSelectors":["_root"],"selector":".noChild a","multiple":true,"delay":0},{"id":"scroll","type":"SelectorElementScroll","parentSelectors":["section"],"selector":"#listingsContainer div.product-content","multiple":true,"delay":"1000"},{"id":"item","type":"SelectorLink","parentSelectors":["scroll"],"selector":"a.line-clamp","multiple":false,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["item"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"price","type":"SelectorText","parentSelectors":["item"],"selector":"span.prod-price-inner","multiple":false,"regex":"","delay":0},{"id":"productcode","type":"SelectorText","parentSelectors":["item"],"selector":".prod-code span","multiple":false,"regex":"","delay":0},{"id":"weight","type":"SelectorText","parentSelectors":["item"],"selector":"span.weight","multiple":false,"regex":"","delay":0},{"id":"allergy","type":"SelectorText","parentSelectors":["item"],"selector":"strong p","multiple":false,"regex":"","delay":0},{"id":"ingredients","type":"SelectorText","parentSelectors":["item"],"selector":".product-description p:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"nutrition","type":"SelectorText","parentSelectors":["item"],"selector":"div.nv-table","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["item"],"selector":"span:nth-of-type(1) p","multiple":false,"regex":"","delay":0}]}

lustek · October 10, 2019, 11:40pm

this is a userscript for asda. In my last message I forgot to mention that anchor claass has to be changed, but you figured it out.

as the product listing loads asynchronously, I needed to add a 2s delay here. the delay in your webscaper sitemap has to be even longer to wait for the userscript

// ==UserScript==
// @name         asda unique links
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  try to take over the world!
// @author       Lukasz Stachowicz
// @include      https://groceries.asda.com/*
// @grant        none
// ==/UserScript==

setTimeout(changeAnchor, 2000);

function changeAnchor(){

    var anchor=document.getElementsByClassName("co-product__anchor");
    for(var i = 0; i < anchor.length; i++){
        anchor[i].href=anchor[i].href + "#from=" +window.location.href ;
    };
};

Drusez · October 11, 2019, 7:29am

This is amazing, thank you so much for your help lustek!

lockdown2020 · February 14, 2021, 11:56pm

// ==UserScript==
// @name         [scriptname]
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  try to take over the world!
// @author       Lukasz Stachowicz
// @include      [insert homepage URL]*
// @grant        none
// ==/UserScript==


setTimeout(changeAnchor, 5000);

    function changeAnchor(){
      'use strict';
    var anchor=document.querySelectorAll("[insert your classes here]"); //console.log(anchor);
    for(var i = 0; i < anchor.length; i++)
{
       anchor[i].href=anchor[i].href + "#from=" +i;

//console.log(anchor[i].href + "#from=" +i);
//console.log(anchor[i].href)
    }}

so I had trouble with this too, posted problem but managed to resolve.

I plugged the above script into tampermonkey. Use the console.log() for trouble shooting when amending to individual sites. Note the timer may need to be set higher depending in site latency. Make sure webscraper has longer loading time than script.

Good luck.

Jaxaay_Annuaire · June 24, 2021, 7:05pm

Hello, I have a similar case
and yet the sitemap is simple with a page link
in Rechercher une société | Bureau d'appui à la Création d'Entreprise

Sitemap :
{"_id":"bce-btp2020","startUrl":["Rechercher une société | Bureau d'appui à la Création d'Entreprise div.field-item","multiple":false,"regex":"","delay":0},{"id":"registre_de_commerce","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-rc-societe div.field-item","multiple":false,"regex":"","delay":0},{"id":"ninea","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-ninea-societe div.field-item","multiple":false,"regex":"","delay":0},{"id":"date_de_creation","type":"SelectorText","parentSelectors":["link"],"selector":"span.date-display-single","multiple":false,"regex":"","delay":0},{"id":"localite","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-localite div.field-item","multiple":false,"regex":"","delay":0},{"id":"gerance","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-gerance-societe div.field-item","multiple":false,"regex":"","delay":0},{"id":"secteur_dactivite","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-secteur div.field-item","multiple":false,"regex":"","delay":0},{"id":"forme_juridique","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-forme-juriduqe div.field-item","multiple":false,"regex":"","delay":0},{"id":"objet_social","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-objet-societe div.field-item","multiple":false,"regex":"","delay":0},{"id":"exercice_social","type":"SelectorText","parentSelectors":["link"],"selector":".field-name-field-exercice-societe div.field-item","multiple":false,"regex":"","delay":0},{"id":"article","type":"SelectorElement","parentSelectors":["link"],"selector":"div#bce_contentnews","multiple":true,"delay":0},{"id":"link","type":"SelectorLink","parentSelectors":["_root"],"selector":".views-field-title a","multiple":true,"delay":0}]}

the settings &page=[0-1] have been added to Start URL
for the scrap of pages 0 until 1 (16 lines in total)
normally I should have 15 lines on the CSV but I have 14 lines exported
there is a duplicate line that Web scrapper has to skip (see capture)

So how do you get around duplicates and export them only once?