Extracting data from websites in a google search

elektrobank · May 7, 2018, 1:41pm

Hello, I'm trying to scrape data from the websites in a google search.

So for each link in the search results it would need to follow the link and extract data from the actual website. It would also need to deal with clicking next in the google search.

I know how to do this for the 1st page, or how to extract the links for multiple pages, but I'm unable to figure out how to extract links and click the for multiple pages. I know I'm missing something simple.

chefas · May 7, 2018, 2:20pm

"Hello" would be a nice word to start your conversation ...

Could you give us an example of you sitemap to tell you more.

Thank's

elektrobank · May 7, 2018, 2:45pm

For instance, this would pull the titles from all articles on techcrunch about the iPhone using google search reaults. It only works for the first page of results. How would I get it to click the next button and keep going? Thanks!

{"_id":"test_google","startUrl":["https://www.google.com/search?&q=site%3Atechcrunch.com+iphone"],"selectors":[{"id":"click_link","type":"SelectorLink","selector":"div.srg h3.r a","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"title","type":"SelectorText","selector":"h1.article__title","parentSelectors":["click_link"],"multiple":false,"regex":"","delay":0}]}

chefas · May 7, 2018, 4:54pm

Hi,
test to change the select of your pagination like this:

{"_id":"test","startUrl":["https://www.google.com/search?q=site:about.me+I&ei=GGbfWt7RG4Od_QbG65OoDw&start=0&sa=N&biw=1280&bih=520"],"selectors":[{"id":"urls","type":"SelectorText","selector":"cite.iUh30","parentSelectors":["_root","pagination"],"multiple":true,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"a.pn","parentSelectors":["_root","pagination"],"multiple":true,"delay1":0}]}

chefas · May 7, 2018, 4:56pm

but be carefull with your selection cause Google can send you something like:

Our systems have detected exceptional traffic on your computer network. This page makes it possible to verify that it is you who send requests, and not a robot.

This page appears when Google automatically detects requests from your computer network that appear to violate the Terms of Service. The blocking will end shortly after the termination of these requests. In the meantime, capturing the CAPTCHA image above will allow you to continue using our services.

Malicious applications, a browser plug-in, or a script that sends automated queries can cause this traffic. If you are using a shared network connection, ask your administrator for help. It is possible that another computer using the same IP address is involved.

You may be prompted to enter characters from the CAPTCHA image if you use advanced terms that robots use or if you send queries very quickly.

elektrobank · May 8, 2018, 6:32pm

Thanks for your response. I see this scrapes the URLs, but how do scrape the sites from the URLs it finds? Can I do this all as 1 sitemap, or do I have to take those URLs and then create a separate scraper? Or the other option is to scrape the URLs, create a webpage with this URL list, then scrape that website.
Thanks!

As for Google, I'm doing a 1 time scrape, and not going that deep, so hopefully I wont have an issue.

chefas · May 9, 2018, 8:43am

Hello,
the sites listed after a google request are constructed dfferently.
So you need to make a sitemap for each of them.

elektrobank · May 9, 2018, 4:22pm

I exported the CSV file of links as an HTML page, then created a sitemap to scrape it and follow all the links. Not the most elegant solution, but it got the job done. Thanks for your help!