Pagination works - data is not scraped

luccoli · August 11, 2023, 12:24pm

First of all, I'm new to this. Bear with me if I did some dumb configuration in the Selectors of the Sitemaps I'll share.

My issue is the following:

I can scarpe the data of a single page
I have multiple pages to scrape though and manually modify the scraping job over 400 times when it can be automated is not ideal;
I can configure the scraper to go through pages (finding the next button)
done that the scraper will not scrape the data I care about.
the only data that was scraped was the "last page" which is not a real page. Clicking "Next" from the actual last page the site I'm scraping from understands that the user is trying to move to a page that does not exists so it will show a "fancy" 404 page pretty much.

There is clearly something I'm doing wrong.
I know there are many other Topics related to this. Going through them I have increased my knowledge in using this awesome tool but I was still not able to make it work, hence I need some help.

The URL of the site I want to scrape from is the following:
Light Novel Pub
and specifically this novel:
Tensei Shitara Slime Datta Ken

Here what I tried:
Sitemap without pagination (text I want to scrape is scraped):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "without-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1" ], "uuid": "2" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "2" ], "textmanipulation": { "removeHtml": true }, "uuid": "3" } ], "sitemapSpecificationVersion": 1 }

Sitemap with pagination (text I want to scrape is not scraped. The tool goes through every single page I care about though):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "chapters-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "pagin", "selector": "a.button.nextchap", "type": "SelectorLink", "multiple": true, "extractAttribute": "href", "parentSelectors": [ "1", "4" ], "uuid": "4" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1", "4" ], "uuid": "5" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "5" ], "textmanipulation": { "removeHtml": true }, "uuid": "6" } ], "sitemapSpecificationVersion": 1 }

If you are able to help me I would be grateful (please if you can also explain where I was mistaken so I can learn ).

luccoli · August 11, 2023, 1:15pm

Hey!

If I understood correctly you are talking about the fact that I can configure the range of pages for the Sitemap.

I did not actually try this because there are some pages that do not follow a simple pattern.

so to be clear the first page I care about has this pattern:
https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-wn-109/chapter-0
the second one is like this:
https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-wn-109/chapter-1-14072110

I thought this would be an issue, not sure if it actually is though.

I will give it a try!
Thank you

leemeng · August 11, 2023, 3:23pm

Wow novel scraping; first time I've seen this use case
Anyway I couldn't get either stiemap to load. You might also want to look into something like
HTTrack which might work better for you.

luccoli · August 14, 2023, 8:29am

Thank you, the solution you gave me worked.
For some reason the scraping order is strange.
The actual scraping job starts from the end of the range I configured. In the actual scraped data the order seem to be random.
Nothing too problematic though as I will be able to handle it afterwards.

Is this behavior normal though?

luccoli · August 14, 2023, 8:35am

Hey!

Thank you for the tip, I will look that tool up too

I'm not too sure why you were not able to load the sitemaps though? I did 'export sitemap' and copied everything there.