Amazon Review Pagination

Hello

I am having trouble with the below code as it is only scraping the first page of the Amazon reviews. It was working before.

Can you please help?

{
    "_id": "amazon_reviews",
    "startUrl": [
      "https://www.amazon.com/Screen-Protector-SPARIN-Tempered-Glass/product-reviews/B013JZCAZK/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
    ],
    "selectors": [
      {
        "id": "review",
        "type": "SelectorElement",
        "parentSelectors": [
          "_root",
          "next"
        ],
        "selector": "div.a-section.review",
        "multiple": true,
        "delay": 0
      },
      {
        "id": "author",
        "type": "SelectorText",
        "parentSelectors": [
          "review"
        ],
        "selector": "span.a-profile-name",
        "multiple": false,
        "regex": "",
        "delay": 0
      },
      {
        "id": "title",
        "type": "SelectorText",
        "parentSelectors": [
          "review"
        ],
        "selector": "a.a-size-base.review-title",
        "multiple": false,
        "regex": "",
        "delay": 0
      },
      {
        "id": "date",
        "type": "SelectorText",
        "parentSelectors": [
          "review"
        ],
        "selector": "span.a-size-base.a-color-secondary",
        "multiple": false,
        "regex": "",
        "delay": 0
      },
      {
        "id": "content",
        "type": "SelectorText",
        "parentSelectors": [
          "review"
        ],
        "selector": "div.a-row.review-data span.a-size-base",
        "multiple": false,
        "regex": "",
        "delay": 0
      },
      {
        "id": "rating",
        "type": "SelectorText",
        "parentSelectors": [
          "review"
        ],
        "selector": "span.a-icon-alt",
        "multiple": false,
        "regex": "",
        "delay": 0
      },
      {
        "id": "next",
        "type": "SelectorLink",
        "parentSelectors": [
          "_root",
          "next"
        ],
        "selector": "li.a-last a",
        "multiple": false,
        "delay": 0
      }
    ]
  }

The pagination for this sitemap no longer works in 2020 due to a change in Amazon's pagination links. They switched from plain HTML links to Javascript links, so Type: Link will no longer work.

I have forked the project on GitHub and I am working on an update. Will post here when ready.

Ah, I thought something was fishy. Thanks a lot!

Hi @leemeng - have you had any luck with the update?

{"_id":"amazon_reviews-2020-limited-pagi","startUrl":["https://www.amazon.com/Ovente-Dual-Sided-Magnification-Electrical-MPWD3185BZ1X7X/product-reviews/B074GCRS9D","https://www.amazon.com/Columbia-Redmond-Waterproof-Cordovan-Regular/product-reviews/B07JH35P96","https://www.amazon.com/Merrell-Mens-Moab-Waterproof-Hiking/product-reviews/B01HF9ZN7I","https://www.amazon.com/Screen-Protector-SPARIN-Tempered-Glass/product-reviews/B013JZCAZK"],"selectors":[{"id":"Product name","type":"SelectorText","parentSelectors":["_root"],"selector":"div[class*='product-title']","multiple":false,"regex":"","delay":0},{"id":"Review wrappers","type":"SelectorElement","parentSelectors":["_root","Click Next"],"selector":"div.a-section.review","multiple":true,"delay":0},{"id":"author","type":"SelectorText","parentSelectors":["Review wrappers"],"selector":"span.a-profile-name","multiple":false,"regex":"","delay":0},{"id":"title","type":"SelectorText","parentSelectors":["Review wrappers"],"selector":"a.a-size-base.review-title","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["Review wrappers"],"selector":"span.a-size-base.a-color-secondary","multiple":false,"regex":"","delay":0},{"id":"content","type":"SelectorText","parentSelectors":["Review wrappers"],"selector":"div.a-row.review-data span.a-size-base","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["Review wrappers"],"selector":"span.a-icon-alt","multiple":false,"regex":"","delay":0},{"id":"Click Next","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.review-views","multiple":false,"delay":"4500","clickElementSelector":"div.a-col-left:not(\":contains('Showing 51-60 of')\") ul .a-last a","clickType":"clickMore","discardInitialElements":"discard","clickElementUniquenessType":"uniqueText"}]}

Amazon US reviews scraper updated for 2020. This sitemap extracts review listings for a single product on Amazon.com using the Web Scraper Chrome Extension. The sitemap handles pagination and now includes the ability to limit number of pages. Please read the instructions and changelog in the comments section below.

INSTRUCTIONS

This sitemap will extract review listings for single products on Amazon US. I have added a pagination limiter which makes it stop at page 6.

The limiter works by searching for pagination text which looks like “Showing 1-10 of 1,766 reviews”.

In this example, the paginator will click Next until it finds “'Showing 51-60 of” which indicates page 6 (the Amazon US site has 10 reviews per page). You need to do some testing and perhaps a bit of math to figure what text will appear on the page you want to stop at.

This limiter can also be removed by deleting the :NOT selector, leaving only

div.a-col-left ul .a-last a

I have tested this sitemap on 4 different urls, which are included in the Starturl section.

CHANGELOG

This sitemap was forked from scrapehero’s sitemap from Jan 2019. Pagination for that sitemap no longer works so I have improved it for 2020.

For pagination, Amazon has switched to JS links from HTML links, so Type: HTML no longer works here. These are the main changes from scrapehero’s sitemap:

  • Changed paginator to Type: Element Click, Click Type: Click More.
  • The paginator no longer needs to be child of itself (recursive). The Click More option handles this.
  • Added a method to limit number of pages. It is based on the :NOT CSS selector. This limiter can easily be removed (see Instructions section).

This sitemap is also on Github at: https://gist.github.com/LeeMeng2020/6ac97d21aa41841ef2033f0467f3c316

Great - thanks @leemeng!! Will give this a try today

Hi @leemeng - I think Amazon may have done something again to their site. The scraping has stopped working for me :-/

I just wanted to check if you see the same?

Thanks!

I ran the same sitemap on June 4, and got all the expected results. Not sure what is the issue on your end.

Ah, I think it was my URL. Thank you!

@leemeng thanks for the script, it works perfectly.

Quick question, is there a way we can stop after first n pages?

For some products there are thousands of reviews and that takes forever and it's not needed in our research process. We might just need to first 10 pages.

Anyway to do that?

Thanks!