Google News sitemap?


Noob here, will appreciate some help.

I try to scrape the search results of a Google News search.

I just need the text of the link and the link itself.

When I view the source of the page I see that the classes have random names.

Does anyone has any experience in scraping Google News? Did anyone here succeeded or failed?

Here is an example search


Ya it is probably a kind of anti-scraping measure. You'll need to look at the source and find more general selectors. Try this:

{"_id":"goog-news-test","startUrl":[""],"selectors":[{"id":"story-and-link wrappers","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.g a[ping]a:not(.top):not(\":contains('View all')\")","multiple":true,"delay":0},{"id":"Story","type":"SelectorText","parentSelectors":["story-and-link wrappers"],"selector":"_parent_","multiple":false,"regex":"","delay":0},{"id":"Link","type":"SelectorLink","parentSelectors":["story-and-link wrappers"],"selector":"_parent_","multiple":false,"delay":0}]}

The story selector uses a couple of not selectors to weed out image links and "View all" links.