How to use a web scraper sitemap in R or Python to scrape websites

Dear all,

I am pretty new to web scraper so please excuse if I have not read all the documentation and forum entries.

I have experimented with web scraper and I have successfully scraped several websites using this great tool.

Now, I would like to use / import the sitemaps generated in web scraper into an R or Python or other programming language to automatize the webscraping of several websites in one run and to do already some data manipulation before saving the scraped data in a database.

Has any of you successfully used the web scraper sitemaps in some R, python or other programming language to scrape websites? Are you aware of any packages for these programming languages that can import web scraper sitemaps to scrape websites?

Best regards

Alexander

Has nobody ever done this? I would really appreciate it if somebody was able to help.

Hi Alexander,

Did you ever figure out how to programmatically import a sitemap?

Thanks,
Rina

The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this python script:

import requests
from bs4 import BeautifulSoup

url = 'http://punefirst.com/post-sitemap.xml/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for loc in soup.select('url > loc'):
    print(loc.text)