Pagination works but only last page content is saved

daniel_w · July 20, 2018, 1:11am

The scraper is going through the pages all fine but somehow only the last page's content shows up in the data (I get one row of output).

Url: http://principlesofaccounting.com/chapter-1/

Sitemap:
{"_id":"principlesofaccounting","startUrl":["https://www.principlesofaccounting.com/chapter-1/"],"selectors":[{"id":"text","type":"SelectorHTML","selector":"article.single-page-content","parentSelectors":["_root","paging"],"multiple":false,"regex":"","delay":0},{"id":"paging","type":"SelectorLink","selector":"li.next a","parentSelectors":["_root","paging"],"multiple":false,"delay":0},{"id":"title","type":"SelectorText","selector":"h1","parentSelectors":["_root","paging"],"multiple":false,"regex":"","delay":0}]}

Alternatively, this startUrl can be used (so it will only go through the last two pages): https://www.principlesofaccounting.com/chapter-24/compound-interest/

iconoclast · July 20, 2018, 9:14am

Hi!

You have to tick multiple on your pagination selector.

daniel_w · July 20, 2018, 12:37pm

What it looks like on my end:

KristapsWS · July 20, 2018, 1:47pm

You have to use element selector next to recursive link selector if you are scraping only text. Here is the updated sitemap:

{"_id":"principlesofaccounting2","startUrl":["https://www.principlesofaccounting.com/chapter-1/"],"selectors":[{"id":"text","type":"SelectorHTML","parentSelectors":["element"],"selector":"article.single-page-content","multiple":false,"regex":"","delay":0},{"id":"title","type":"SelectorText","parentSelectors":["element"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"paging","type":"SelectorLink","parentSelectors":["_root","paging"],"selector":"li.next a","multiple":true,"delay":0},{"id":"element","type":"SelectorElement","parentSelectors":["_root","paging"],"selector":"body","multiple":true,"delay":0}]}

daniel_w · July 20, 2018, 9:41pm

Excellent, thank you! Could you explain to me why this is, though, or provide me with a link that explains it further? Does that mean if I were to include an image it would work just fine?

Is there a way to scrape elements on a page based on some condition? For example, every time the <h1> on the page includes "Chapter", scrape a link with a specific class/ID from the same page.

Also, is it normal that the order of rows appears to be random after scraping?

mazvmiguel · October 9, 2018, 9:20pm

thank you very much. Recursion is powerful