Scrape a website that redirects to another website

Hello,
I am trying to scrape this site as the content has been made unavailable since the shut it down (journalists do not even have access to their articles). Is there any way around the redirect using this tool? Thank you for your help!

Url: http://dcist.com

Sitemap:
{"_id":"DCist","startUrl":["http://www.dcist.com"],"selectors":[{"id":"article-title-text","multiple":false,"parentSelectors":["_root"],"regex":"","selector":".post-title","type":"SelectorText"},{"id":"author-text","multiple":false,"parentSelectors":["_root"],"regex":"","selector":".post-author-twitter a","type":"SelectorText"},{"id":"article-date-text","multiple":false,"parentSelectors":["_root"],"regex":"","selector":".story-meta span.post-timestamp","type":"SelectorText"},{"id":"article-text","multiple":false,"parentSelectors":["_root"],"regex":"","selector":".story-content","type":"SelectorText"}]}

Sorry to hear about your situation. This site's xml sitemaps (not the same as WS sitemaps) still seem to be available, and those provide direct links:
https://dcist.com/sitemap-index-1.xml

Here's an example which scrapes from sitemap-35 (limited to first 8 pages for testing):

{"_id":"dcist-test","startUrl":["https://dcist.com/sitemap-35.xml"],"selectors":[{"id":"Click links","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"tbody > tr:nth-of-type(-n+8) td a","type":"SelectorLink"},{"id":"Title","multiple":false,"parentSelectors":["Click links"],"regex":"","selector":".post-title span","type":"SelectorText"},{"id":"Date","multiple":false,"parentSelectors":["Click links"],"regex":"","selector":".story-meta p.post-timeslug","type":"SelectorText"},{"id":"Byline","multiple":false,"parentSelectors":["Click links"],"regex":"","selector":"div.post-author-twitter","type":"SelectorText"},{"id":"Content","multiple":false,"parentSelectors":["Click links"],"regex":"","selector":"div.story-content","type":"SelectorText"}]}

Tested with Page load delay: 3500

To remove the pages limit, edit the links selector and delete the nth-of-type bit, leaving only:
tbody > tr td a

Also check out wget.

1 Like