Scrapping from XML Sitemap

Hi All

I have managed to scrap a site but it takes hours because it has to find all the product page URL's first in a very convoluted way. There is an XML sitemap that lists all the product page URLs but can webscraper extract these URL's from an XML sitemap? Clicking on "Select" doesn't work so I wondered if there as a way to say search for take the text that follows until it finds . That text is the URL. Then move on to the next , etc.

Is this possible or beyond the limitations of this program?

Thanks

Nick

Please help :slight_smile:

I ended up manually creating a and importing sitemap.
Just take the pages in your xml sitemap, and insert them in something like this

{"_id":"empire-001","startUrl":["http://url.com/1/","http://url.com/2/","http://url.com/3/","http://url.com/4/","http://url.com/5/","http://url.com/6/"}"],"selectors":[{"id":"title","type":"SelectorText","selector":"div.stage-caption p","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"content","type":"SelectorHTML","selector":"div.note-izq","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"image","type":"SelectorImage","selector":"div.img-cont img","parentSelectors":["_root"],"multiple":false,"delay":0}]}

Hi hjbarraza, what did you do with this JSON? Is this the code you put into webscraper?

Okay I think I understand what you mean. Basically grab all the links manually out of the XML and scrape them. The problem then is that if the XML sitemap updates we'll be missing links. I'm wondering if embedding XML into HTML is possible. Googling...

A quick and dirty solution would be to:

  1. Create a Google Sheet
  2. Use IMPORTXML to import the URLs from the sitemap
  3. Publish the sheet as a web page
  4. Use the published Google Sheet web page as the root of your scraper

Google Sheet:

Published as web page:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSHgzkngGy-D3cuneYs1u0FLf61BSk_bEwa_w3EP-TPuVdxpRvqyONyXFyl5ZzIXjptKk4sgGwbkyec/pubhtml?gid=0&single=true

1 Like

Can you give an example on step 4?

@davem i am trying to use a google sheet as the root of my scraper but i 've got some issues.. maybe you can help me ?
the google sheet got a column with several urls in rows.
I would like the scraper visits all the urls to scrape data in each url page but the Link selector is not working... i tried the pagination selector and it is not working either.. how do you settle the selectors to ask the scraper to visit all the urls contained in a google sheet ?