Pagination works - data is not scraped

First of all, I'm new to this. Bear with me if I did some dumb configuration in the Selectors of the Sitemaps I'll share.

My issue is the following:

  • I can scarpe the data of a single page
    I have multiple pages to scrape though and manually modify the scraping job over 400 times when it can be automated is not ideal;
  • I can configure the scraper to go through pages (finding the next button)
    done that the scraper will not scrape the data I care about.
    the only data that was scraped was the "last page" which is not a real page. Clicking "Next" from the actual last page the site I'm scraping from understands that the user is trying to move to a page that does not exists so it will show a "fancy" 404 page pretty much.

There is clearly something I'm doing wrong.
I know there are many other Topics related to this. Going through them I have increased my knowledge in using this awesome tool but I was still not able to make it work, hence I need some help.

The URL of the site I want to scrape from is the following:
Light Novel Pub
and specifically this novel:
Tensei Shitara Slime Datta Ken

Here what I tried:
Sitemap without pagination (text I want to scrape is scraped):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "without-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1" ], "uuid": "2" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "2" ], "textmanipulation": { "removeHtml": true }, "uuid": "3" } ], "sitemapSpecificationVersion": 1 }

Sitemap with pagination (text I want to scrape is not scraped. The tool goes through every single page I care about though):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "chapters-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "pagin", "selector": "a.button.nextchap", "type": "SelectorLink", "multiple": true, "extractAttribute": "href", "parentSelectors": [ "1", "4" ], "uuid": "4" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1", "4" ], "uuid": "5" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "5" ], "textmanipulation": { "removeHtml": true }, "uuid": "6" } ], "sitemapSpecificationVersion": 1 }

If you are able to help me I would be grateful (please if you can also explain where I was mistaken so I can learn :sweat_smile:).

Hey!

If I understood correctly you are talking about the fact that I can configure the range of pages for the Sitemap.

I did not actually try this because there are some pages that do not follow a simple pattern.

so to be clear the first page I care about has this pattern:
https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-wn-109/chapter-0
the second one is like this:
https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-wn-109/chapter-1-14072110

I thought this would be an issue, not sure if it actually is though.

I will give it a try!
Thank you :smiley:

Wow novel scraping; first time I've seen this use case :sweat_smile:
Anyway I couldn't get either stiemap to load. You might also want to look into something like
HTTrack which might work better for you.

Thank you, the solution you gave me worked.
For some reason the scraping order is strange.
The actual scraping job starts from the end of the range I configured. In the actual scraped data the order seem to be random.
Nothing too problematic though as I will be able to handle it afterwards.

Is this behavior normal though?

Hey!

Thank you for the tip, I will look that tool up too :slight_smile:

I'm not too sure why you were not able to load the sitemaps though? I did 'export sitemap' and copied everything there.