First of all, I'm new to this. Bear with me if I did some dumb configuration in the Selectors of the Sitemaps I'll share.
My issue is the following:
- I can scarpe the data of a single page
I have multiple pages to scrape though and manually modify the scraping job over 400 times when it can be automated is not ideal; - I can configure the scraper to go through pages (finding the next button)
done that the scraper will not scrape the data I care about.
the only data that was scraped was the "last page" which is not a real page. Clicking "Next" from the actual last page the site I'm scraping from understands that the user is trying to move to a page that does not exists so it will show a "fancy" 404 page pretty much.
There is clearly something I'm doing wrong.
I know there are many other Topics related to this. Going through them I have increased my knowledge in using this awesome tool but I was still not able to make it work, hence I need some help.
The URL of the site I want to scrape from is the following:
Light Novel Pub
and specifically this novel:
Tensei Shitara Slime Datta Ken
Here what I tried:
Sitemap without pagination (text I want to scrape is scraped):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "without-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1" ], "uuid": "2" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "2" ], "textmanipulation": { "removeHtml": true }, "uuid": "3" } ], "sitemapSpecificationVersion": 1 }
Sitemap with pagination (text I want to scrape is not scraped. The tool goes through every single page I care about though):
{ "rootSelector": { "id": "_root", "uuid": "0" }, "_id": "chapters-pagination", "startUrls": [ "https://www.lightnovelpub.com/novel/tensei-shitara-slime-datta-ken-ln-14072110" ], "selectors": [ { "id": "chapter0", "selector": "a#readchapterbtn.button", "type": "SelectorLink", "extractAttribute": "href", "parentSelectors": [ "0" ], "uuid": "1" }, { "id": "pagin", "selector": "a.button.nextchap", "type": "SelectorLink", "multiple": true, "extractAttribute": "href", "parentSelectors": [ "1", "4" ], "uuid": "4" }, { "id": "chapter-element", "selector": "div.chapter-content", "type": "SelectorElement", "parentSelectors": [ "1", "4" ], "uuid": "5" }, { "id": "chapter", "selector": "p", "type": "SelectorText", "multiple": true, "parentSelectors": [ "5" ], "textmanipulation": { "removeHtml": true }, "uuid": "6" } ], "sitemapSpecificationVersion": 1 }
If you are able to help me I would be grateful (please if you can also explain where I was mistaken so I can learn ).