Google News sitemap?

hananc · April 1, 2019, 11:46am

Hello,

Noob here, will appreciate some help.

I try to scrape the search results of a Google News search.

I just need the text of the link and the link itself.

When I view the source of the page I see that the classes have random names.

Does anyone has any experience in scraping Google News? Did anyone here succeeded or failed?

Here is an example search https://www.google.com/search?tbm=nws&q=Palestine&oq=Palestine

Thanks!

leemeng · January 6, 2020, 3:46am

Ya it is probably a kind of anti-scraping measure. You'll need to look at the source and find more general selectors. Try this:

{"_id":"goog-news-test","startUrl":["https://www.google.com/search?tbm=nws&q=iraq&oq=iraq"],"selectors":[{"id":"story-and-link wrappers","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.g a[ping]a:not(.top):not(\":contains('View all')\")","multiple":true,"delay":0},{"id":"Story","type":"SelectorText","parentSelectors":["story-and-link wrappers"],"selector":"_parent_","multiple":false,"regex":"","delay":0},{"id":"Link","type":"SelectorLink","parentSelectors":["story-and-link wrappers"],"selector":"_parent_","multiple":false,"delay":0}]}

The story selector uses a couple of not selectors to weed out image links and "View all" links.