Advice for scraping whole TripAdvisor review rather than partial review

Hello,
I am trying to scrape the reviews for the Morton Arboretum from TripAdvisor. In addition to the whole review, I would like to scrape the username, date of experience, bubble rating, and title. As of today, March 28, 2019, there are 868 reviews spread across 87 pages. I first tried to paginate through all the pages, then, open each review to scrape the entire review, username, experience date, bubble rating, and title. The problem with this method is that TripAdvisor lists the username, after the first page of reviews, as TripAdvisor Member in the detailed review page. Instead of listing the actual username.

The second method I have tried is to go through the 87 pages of reviews and use the elementclick to expand the ‘more’ expansion link. TripAdvisor hides the complete review under the ‘more’ expansion link. I have not been successful in expanding more than the first couple of ‘more’ links.

Thoughts?

Site URL:
https://www.tripadvisor.com/Attraction_Review-g36269-d132786-Reviews-or330-Morton_Arboretum-Lisle_DuPage_County_Illinois.html

Scrape method 1:

{"_id":"march21mortonarboretum2","startUrl":["https://www.tripadvisor.com/Attraction_Review-g36269-d132786-Reviews-Morton_Arboretum-Lisle_DuPage_County_Illinois.html"],"selectors":[{"id":"elementselector","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.ratings_and_types","multiple":true,"delay":0,"clickElementSelector":"div.mobile-more a.nav.next","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"link_review","type":"SelectorLink","parentSelectors":["elementselector"],"selector":"a.title","multiple":true,"delay":0},{"id":"wholereviewinsidewholereview","type":"SelectorText","parentSelectors":["link_review"],"selector":"span.fullText","multiple":false,"regex":"","delay":0},{"id":"username","type":"SelectorText","parentSelectors":["elementselector"],"selector":"div.info_text div:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","parentSelectors":["elementselector"],"selector":"strong","multiple":false,"regex":"","delay":0},{"id":"revieweddate","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.ratingDate","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorElementAttribute","parentSelectors":["elementselector"],"selector":"span.ui_bubble_rating","multiple":false,"extractAttribute":"class","delay":0},{"id":"contributionsnumber","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"helpfulvotes","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(4)","multiple":false,"regex":"","delay":0}]}

Scrape method 2: A Shorter sitemap to test the click ‘more’ scrape functionality

{"_id":"march27morton5","startUrl":["https://www.tripadvisor.com/Attraction_Review-g36269-d132786-Reviews-Morton_Arboretum-Lisle_DuPage_County_Illinois.html"],"selectors":[{"id":"clickmore","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.rev_wrap","multiple":true,"delay":"300","clickElementSelector":"div.prw_rup p.partial_entry span.taLnk","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"review","type":"SelectorText","parentSelectors":["clickmore"],"selector":"div.entry","multiple":true,"regex":"","delay":"300"}]}

You can use the first option, with the condition, that you make the first element selector for only the review container so that the scraper has defined boundaries from where to extract the information. In this case, the scraper will have an understanding of which information should go under the same row in the data.

This sitemap should work:

{"_id":"march21mortonarboretum2","startUrl":["https://www.tripadvisor.com/Attraction_Review-g36269-d132786-Reviews-Morton_Arboretum-Lisle_DuPage_County_Illinois.html"],"selectors":[{"id":"elementselector","type":"SelectorElementClick","parentSelectors":["_root"],"selector":".review-container","multiple":true,"delay":"2000","clickElementSelector":"div.mobile-more a.nav.next","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"link_review","type":"SelectorLink","parentSelectors":["elementselector"],"selector":"a.title","multiple":false,"delay":0},{"id":"wholereviewinsidewholereview","type":"SelectorText","parentSelectors":["link_review"],"selector":"span.fullText","multiple":false,"regex":"","delay":0},{"id":"username","type":"SelectorText","parentSelectors":["elementselector"],"selector":"div.info_text div:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","parentSelectors":["elementselector"],"selector":"strong","multiple":false,"regex":"","delay":0},{"id":"revieweddate","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.ratingDate","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorElementAttribute","parentSelectors":["elementselector"],"selector":"span.ui_bubble_rating","multiple":false,"extractAttribute":"class","delay":0},{"id":"contributionsnumber","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"helpfulvotes","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(4)","multiple":false,"regex":"","delay":0}]}

P.S. to play it safe, add 2000 delay to your Click selector, so that the scraper has time to render the page.

2 Likes

This worked perfectly, thanks!

Hi @webber

I'm having a similar problem trying to scrape a tripadvisor listing. The issue is that I can't get all the "more" links (even though I get all of them when I preview the data). Also, when I export as CSV, I don't get matching columns.

What I need from the reviews is:

  • Title of the review
  • Rating
  • FULL description
  • Date

Site URL:

Sitemap:
{"_id":"tripadvisor","startUrl":["https://www.tripadvisor.cl/Restaurant_Review-g294305-d1059717-Reviews-China_Village-Santiago_Santiago_Metropolitan_Region.html"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":"a.pageNum:nth-of-type(n+2)","multiple":true,"delay":0},{"id":"wrappers","type":"SelectorElement","parentSelectors":["_root","pagination"],"selector":"div.is-9","multiple":true,"delay":0},{"id":"titulo","type":"SelectorText","parentSelectors":["wrappers"],"selector":"span.noQuotes","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorElementAttribute","parentSelectors":["wrappers"],"selector":"span.ui_bubble_rating","multiple":false,"extractAttribute":"class","delay":0},{"id":"descripcion","type":"SelectorText","parentSelectors":["wrappers"],"selector":"p","multiple":false,"regex":"","delay":0},{"id":"fecha","type":"SelectorText","parentSelectors":["wrappers"],"selector":"span.ratingDate","multiple":false,"regex":"","delay":0},{"id":"mostrar-mas","type":"SelectorElementClick","parentSelectors":["wrappers"],"selector":"[data-collapsed='true'] p","multiple":true,"delay":0,"clickElementSelector":"span.ulBlueLinks","clickType":"clickOnce","discardInitialElements":"discard","clickElementUniquenessType":"uniqueText"}]}

Thank you for your time!

Did I get "null" in the comments using the following sitemap? Can you help me?

{"_id":"march21mortonarboretum2","startUrl":["https://www.tripadvisor.pt/Restaurant_Review-g189180-d12966802-Reviews-MUU_Steakhouse-Porto_Porto_District_Northern_Portugal.html"],"selectors":[{"id":"elementselector","type":"SelectorElementClick","parentSelectors":["_root"],"selector":".review-container","multiple":true,"delay":"2000","clickElementSelector":"div.mobile-more a.nav.next","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"link_review","type":"SelectorLink","parentSelectors":["elementselector"],"selector":"a.title","multiple":false,"delay":0},{"id":"wholereviewinsidewholereview","type":"SelectorText","parentSelectors":["link_review"],"selector":"span.fullText","multiple":false,"regex":"","delay":0},{"id":"username","type":"SelectorText","parentSelectors":["elementselector"],"selector":"div.info_text div:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","parentSelectors":["elementselector"],"selector":"strong","multiple":false,"regex":"","delay":0},{"id":"revieweddate","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.ratingDate","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorElementAttribute","parentSelectors":["elementselector"],"selector":"span.ui_bubble_rating","multiple":false,"extractAttribute":"class","delay":0},{"id":"contributionsnumber","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"helpfulvotes","type":"SelectorText","parentSelectors":["elementselector"],"selector":"span.badgetext:nth-of-type(4)","multiple":false,"regex":"","delay":0}]}

hello
I am trying to scrape the reviews for Chitwan National Park Nepal. I got confused in what to put in the selector as you have mentioned .review-container, I wanted to know what exactly does this review-container mean?
Please help, I will be grateful.
Thank you.