How to use a web scraper sitemap in R or Python to scrape websites

loschky · August 29, 2018, 7:34pm

Dear all,

I am pretty new to web scraper so please excuse if I have not read all the documentation and forum entries.

I have experimented with web scraper and I have successfully scraped several websites using this great tool.

Now, I would like to use / import the sitemaps generated in web scraper into an R or Python or other programming language to automatize the webscraping of several websites in one run and to do already some data manipulation before saving the scraped data in a database.

Has any of you successfully used the web scraper sitemaps in some R, python or other programming language to scrape websites? Are you aware of any packages for these programming languages that can import web scraper sitemaps to scrape websites?

Best regards

Alexander

loschky · September 5, 2018, 2:19pm

Has nobody ever done this? I would really appreciate it if somebody was able to help.

rinarivera · September 19, 2019, 2:50pm

Hi Alexander,

Did you ever figure out how to programmatically import a sitemap?

Thanks,
Rina

irvinborder · October 16, 2019, 5:35am

The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this python script:

import requests
from bs4 import BeautifulSoup

url = 'http://punefirst.com/post-sitemap.xml/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for loc in soup.select('url > loc'):
    print(loc.text)