Going Crazy! Another pagination issue

gozwald · April 27, 2018, 7:41pm

Hey Guys,

I could use some help with paginations here.

Url: http://www.companylisting.ca/CA/all/311/page1.aspx

Im trying to get the scraper to click on the company names, and then extract the details in the following page.

Afterward, it should click on the "next page button" at the end of the pagination group at the top of the page.

Back to step one....repeat

With the sitemap i created, all it does is cycle from page to page, but doesnt enter any of the listings / details.

It actually works when i dont have the pagination done recursively, however the moment i make it a parent to itself, the error persists.

Any help would be greatly appreciated!

Sitemap:
{"_id":"mancan","startUrl":["http://www.companylisting.ca/CA/all/311/page1.aspx"],"selectors":[{"id":"list","type":"SelectorLink","selector":"a.rui_title","parentSelectors":["_root","pagination"],"multiple":true,"delay":0},{"id":"details","type":"SelectorText","selector":"p.pl_company_info","parentSelectors":["list"],"multiple":true,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"a#ContentPlaceHolder1_TopNextPage.page-link","parentSelectors":["_root","pagination"],"multiple":true,"delay":0}]}

chefas · April 28, 2018, 3:23pm

Hi

test this scrapesite:

{"_id":"test","startUrl":["http://www.companylisting.ca/CA/all/313110/default.aspx?d=cat"],"selectors":[{"id":"list","type":"SelectorLink","selector":"a.rui_title","parentSelectors":["_root","pagination"],"multiple":true,"delay":0},{"id":"name","type":"SelectorText","selector":"h1.pl_company_name span","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"adress","type":"SelectorText","selector":"span#ContentPlaceHolder1_CI_Address","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","selector":"p#ContentPlaceHolder1_ShowCompanyPhone.pl_company_info span","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"a#ContentPlaceHolder1_TopNextPage.page-link","parentSelectors":["_root","pagination"],"multiple":true,"delay":0}]}

gozwald · April 29, 2018, 11:00pm

Thanks so much for responding!

So after tinkering a bit with your script and seeing how it worked on the test page you suggested i realized the following:

My original sitemap worked, however (and this is a big however) for some strange reason the scraper first has to cycle through all of the pagination links all the way until it can no longer go forward, and then begins pulling the page details from the very last page proceeding backward till page 1.

Is this a bug in the scraper, or am I still not understanding something here?

Looking forward to your answer!

chefas · April 30, 2018, 9:16am

Hello

no bug at all !

This is the normal way to proceed with Web screaper.

First it deals with pagination, it starts with page 1, page 2 ... until the last page.
During this first step, no data is captured and if during this first step you stop the process, you will not recover any data.

When all the paging is done (and it can take a long time !) and if your sitemap is properly built, Step 2 begins and this is the real time when scratching is done. This scratching is done in reverse: first the last page, then page n-1, then n-2 to page 1

You can not avoid step 1. So when tuning your work, it is advisable to test your scratching with a few pages, say less than 5 pages.

gozwald · April 30, 2018, 9:58am

Ok that makes perfect sense.

The natural challenge with something like this is that given the scope of the number of pages involved here, if something goes wrong in the middle of the scraper determining the number of pagination links, (as you pointed out) all is lost.

Is there any other way it can be done? for example, if i could tell the scraper there are X number of pages...this way it wouldnt have to discover that information first.

Or am I getting greedy here

chefas · April 30, 2018, 10:49am

You could post a feature request on this forum

gozwald · April 30, 2018, 11:01am

will do. Thank you so much for all your help.

chefas · May 1, 2018, 5:28pm

you are welcome my friend

KristapsWS · May 2, 2018, 2:54pm

In this case you can change places with item and pagination selector in your json and re-import your sitemap. This will make web scraper to first scrape all the available items on the list before going to the next one.

However if we are talking about element click selector, you will have to wait for element click selector to click through all of the pages before starting to scrape items or return any data.