Scraping Google Translate

Instead of going through the hassle of the Google Translate API, I was thinking of just scraping Google Translate's page (example: https://translate.google.ca/?hl=en&tab=wT#en/el/hello), but the https://translate.google.ca/robots.txt has the following:

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /?
Allow: /?hl=
Disallow: /?hl=&
Allow: /?hl=
&sl=
Disallow: /?hl=&sl=&
Allow: /?hl=&tl=
Disallow: /?hl=
&tl=&
Allow: /?hl=
&sl=&tl=
Disallow: /?hl=
&sl=&tl=&
...

which makes me think that I am not allowed to scrape what I want. Am I right?
If so, and if I DO choose to scrape it, will Google block my IP?
If I'm reading this wrong and I AM allow to scrape it, should I put in a delay between requests, or can I just shoot thousands of requests at it a second. What's best practice? I'm new to this.

Nicholas

Robot.txt are files for Search Engines index or not the contents. It's only used for telling to Robots what are allowed or disallow. It's only used for this purpose.

I understand that, but in the Robots.txt file, it lists: Disallow: /?hl=*&. The URL that I want to scrape is https://translate.google.ca/?hl=en&tab=wT#en/el/hello. Does that not mean that I am not allowed to scrape that page?

Hi!

WebScraper works as it's someone (alive) that opens a page, and reads it. Once scrape is complete, it just shuts the tab opened for scrape.

My question is about general web scraping, not about this software specifically. So far, the answers to my question have not be helpful.

Like i told you, the robot.txt are not related to scapping. It's related to indexing a website on the searching engine. Regarding scrapping Translate from Google; it's not something you can scrape at my opinion. You need to enter text data in the translation side and it's translate "live" in other selected language. Like iconoclast said, any kind of scrapper are made for recording any datas on webpage that are visible by scrolling some URL and pagination. I'm not sure scrapping tools will do i what you expect.

Another point, if you tried to scrape Google Translate, it's possible that Google block your IP. I'm pretty sure Google protect their application and only provide API for giving access to their tool by allowing only the data and informations their allowed via the API for their software. According to some cases i've found it's this message you will get when you've been banned.

"We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.

We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.

If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center.

If your entire network is affected, more information is available in the Google Web Search Help Center.

We apologize for the inconvenience, and hope we'll see you again on Google."

1 Like

Now that is the answer I was looking for. Thank you very much, 8ternity.

Regarding how to use Google Translate plus scraping, the format is https://translate.google.ca/?hl=en&tab=wT#en/el/hello for translating "hello" from English to Greek. I use HTMLAgilityPack and C# for scraping, and you can load pre-translated Google Translate pages by modifying URL and then grab certain sections of the HTML using XPATH through HTMLAgilityPack. It works really well. I've gotten 300MB of news articles (25,000-ish articles in all) using this method so far. It's very good and totally free.

That's an interesting idea. I did give it a try. Turns out the HTML response that Google sends is mostly JavaScript, does not contain any of the "translated from" or "translated to" text which normally appears in a web browser. I'm no expert in JavaScript but what I can tell from looking at the HTML response is that all that JavaScript is supposed to execute on a web browser in real time, it fetches the translated text and then presents in a frame in web browser. The HTML AgilityPack only grabs the JavaScript. How did you manage to get it to execute / achieve your translated text?

You know, it turns out you're right. I never actually ran this code against Google Translate for fear of getting blocked, but I did a couple of times today and it just brings back the JavaScript like you said. I'm curious if there is a way to scrape it and actually get the values with something like HTMLAgilityPack (or some other kind of library), but I'm sure there are webscraping tools that do this. I've managed to scrape news site and wikipedia without a problem.