Need help on pagination not working

8ternity · November 11, 2018, 3:24am

In this website, i try to extract all the contacts with postal informations for my project and can't get working the pagination because the a button are called has "blank" &nbps; and only the first page are getting data collected.

Url: https://www.centris.ca/en/real-estate-broker~chantale-yargeau~proprio-direct-val-d-or/G6101?view=Summary&pback=true

i try with LINK and also Element Link without working. It's stop on first page and not loading the page.

I need help please.

Any help is appreciated. Also, if you send me the corrected sitemap, tell me how you resolved it. I want to know how to resolve this issue on another project.

Thanks a lot.

sitemap:
{"_id":"centris","startUrl":["https://www.centris.ca/fr"],"selectors":[{"id":"mon courtier","type":"SelectorLink","parentSelectors":["_root"],"selector":"li:nth-of-type(3) a.main-item","multiple":false,"delay":0},{"id":"trouver un courtier","type":"SelectorLink","parentSelectors":["mon courtier"],"selector":"h2.nav-title.first a","multiple":false,"delay":0},{"id":"sommaire","type":"SelectorLink","parentSelectors":["trouver un courtier"],"selector":"li.summary a.current","multiple":false,"delay":0},{"id":"nom de l'agent","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"h1.name span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"nom de l'agence","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"div.agencyid h2.smaller","multiple":false,"regex":"","delay":0},{"id":"adresse","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"span.address a span","multiple":false,"regex":"","delay":0},{"id":"ville, province, code","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"p.agencycontact span.address > span","multiple":false,"regex":"","delay":0},{"id":"telephone","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"p.agencycontact span.phone span","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root","sommaire","pagination"],"selector":"div.right-section li.next a","multiple":false,"delay":0,"clickElementSelector":"div.right-section li.next a","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"}]}

bretfeig · November 11, 2018, 4:58am

8ternity:

{"_id":"centris","startUrl":["https://www.centris.ca/fr"],"selectors":[{"id":"mon courtier","type":"SelectorLink","parentSelectors":["_root"],"selector":"li:nth-of-type(3) a.main-item","multiple":false,"delay":0},{"id":"trouver un courtier","type":"SelectorLink","parentSelectors":["mon courtier"],"selector":"h2.nav-title.first a","multiple":false,"delay":0},{"id":"sommaire","type":"SelectorLink","parentSelectors":["trouver un courtier"],"selector":"li.summary a.current","multiple":false,"delay":0},{"id":"nom de l'agent","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"h1.name span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"nom de l'agence","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"div.agencyid h2.smaller","multiple":false,"regex":"","delay":0},{"id":"adresse","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"span.address a span","multiple":false,"regex":"","delay":0},{"id":"ville, province, code","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"p.agencycontact span.address > span","multiple":false,"regex":"","delay":0},{"id":"telephone","type":"SelectorText","parentSelectors":["sommaire","pagination"],"selector":"p.agencycontact span.phone span","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root","sommaire","pagination"],"selector":"div.right-section li.next a","multiple":false,"delay":0,"clickElementSelector":"div.right-section li.next a","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"}]}

See if you can work out what I did. I'm not entirely sure loading 1107 pages of data won't crash your browser but it will cycle through all 1107 pages before it scrapes anything.

{"_id":"centris","startUrl":["https://www.centris.ca/fr"],"selectors":[{"id":"mon courtier","type":"SelectorLink","parentSelectors":["_root"],"selector":"li:nth-of-type(3) a.main-item","multiple":false,"delay":0},{"id":"trouver un courtier","type":"SelectorLink","parentSelectors":["mon courtier"],"selector":"h2.nav-title.first a","multiple":false,"delay":0},{"id":"sommaire","type":"SelectorLink","parentSelectors":["pagination"],"selector":"a.btn","multiple":false,"delay":0},{"id":"nom de l'agent","type":"SelectorText","parentSelectors":["sommaire"],"selector":"h1.name span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"nom de l'agence","type":"SelectorText","parentSelectors":["sommaire"],"selector":"div.agencyid h2.smaller","multiple":false,"regex":"","delay":0},{"id":"adresse","type":"SelectorText","parentSelectors":["sommaire"],"selector":"span.address a span","multiple":false,"regex":"","delay":0},{"id":"ville, province, code","type":"SelectorText","parentSelectors":["sommaire"],"selector":"p.agencycontact span.address > span","multiple":false,"regex":"","delay":0},{"id":"telephone","type":"SelectorText","parentSelectors":["sommaire"],"selector":"p.agencycontact span.phone span","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorElementClick","parentSelectors":["trouver un courtier"],"selector":"div.thumbnailItem","multiple":true,"delay":0,"clickElementSelector":"div.right-section li.next a","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"}]}

8ternity · November 11, 2018, 3:35pm

Hi bretfeig,

The address are missing. I need the physical address and telephone. You need to click on "Sommaire" tab. The first tab is partial informations.

I need
Agent Name
Agent Agency
Adresse
City, Province and Zip code
Phone number

All theses informations are available in the sommaire tab.

Can you get it worked with that ?

THanks for you help.

iconoclast · November 11, 2018, 10:52pm

Hello there!

It seems you've put way too much pagination selectors -- both Link and Selector Click. If you need only profile information scrape, just go from page 1.

In order for it to properly navigate through all the pages, you have to select both information and pagination into selector.

Try this one out:

{"_id":"centris","startUrl":["https://www.centris.ca/en/real-estate-broker~majd-gerges~re-max-2001-inc./E8572?view=Summary"],"selectors":[{"id":"nom de l'agent","type":"SelectorText","parentSelectors":["pagination"],"selector":"h1.name span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"nom de l'agence","type":"SelectorText","parentSelectors":["pagination"],"selector":"div.agencyid h2.smaller","multiple":false,"regex":"","delay":0},{"id":"adresse","type":"SelectorText","parentSelectors":["pagination"],"selector":"span.address a span","multiple":false,"regex":"","delay":0},{"id":"ville, province, code","type":"SelectorText","parentSelectors":["pagination"],"selector":"p.agencycontact span.address > span","multiple":false,"regex":"","delay":0},{"id":"telephone","type":"SelectorText","parentSelectors":["pagination"],"selector":"p.agencycontact span.phone span","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"section","multiple":true,"delay":"1500","clickElementSelector":"div.right-section li.next:not(.inactive) a","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"}]}

8ternity · November 11, 2018, 11:08pm

iconoclast:

Your code is working well.

But where was my mistake?
Here :
"selector":"span.address a span","multiple":false,"regex":"","delay":0

I remember the selector was selected in the next a.btn in css reference.

bretfeig · November 11, 2018, 11:13pm

@iconoclast

I came up with this but 13,000 elements is way too much and it eventually crashes

{"_id":"centris","startUrl":["https://www.centris.ca/fr/courtiers-immobiliers?uc=6"],"selectors":[{"id":"Paginate","type":"SelectorLink","parentSelectors":["_root","Paginate"],"selector":".next","multiple":false,"delay":0},{"id":"Element","type":"SelectorElementClick","parentSelectors":["_root","Paginate"],"selector":"div.infos div.container","multiple":true,"delay":0,"clickElementSelector":"div.right-section li.next a","clickType":"clickMore","discardInitialElements":true,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"Bane","type":"SelectorText","parentSelectors":["Element"],"selector":"h1.name span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"Mobile Phone","type":"SelectorText","parentSelectors":["Element"],"selector":"p.contact a.telLinkerInserted","multiple":false,"regex":"","delay":0},{"id":"Title","type":"SelectorText","parentSelectors":["Element"],"selector":"h1.name span.second-line","multiple":false,"regex":"","delay":0},{"id":"Address1","type":"SelectorText","parentSelectors":["Element"],"selector":"a span","multiple":false,"regex":"","delay":0},{"id":"Address 2","type":"SelectorText","parentSelectors":["Element"],"selector":"span.address > span","multiple":false,"regex":"","delay":0},{"id":"Agency Phone ","type":"SelectorText","parentSelectors":["Element"],"selector":"p.agencycontact span.telLinkerInserted","multiple":false,"regex":"","delay":0}]}

8ternity · November 11, 2018, 11:18pm

Where did you take the conclusion of "section" in single selector? In the HMTL source or in Webscraper documentation?

Also, their is a way to limit search with auto stop after 14000 tries? Because it will never stop after the cycle?

8ternity · November 11, 2018, 11:20pm

I've scrap over 15000 lines with yours. But i have a lot of free space in csv file. Data separated into 2 lines.

iconoclast · November 12, 2018, 12:03am

@8ternity

You can always find needed information regarding proper selection for an Element Selector within Documentation, or in video tutorials that are on a main website.

Think of it as it's a wrapper, that acts as a parent to all elements that are contained inside (talking about Selector for an Element Click selector).

@bretfeig

Your browser crashed when you used CouchDB?

8ternity · November 12, 2018, 3:05am

Samething, its crashing about 1 hour later.

I wasnt able to download the job.

bretfeig · November 12, 2018, 11:07am

yes it slowed to the point where it was no longer loading elements and then it ended prematurely.

I've noticed with element select, it doesn't write to couchDB until it scrolls through all elements and completes the scrape (similar to how previewing the data shows nothing until it scrolls through all elements.

iconoclast · November 12, 2018, 12:19pm

Then the only option to go is to use Link selector set as it's own parent, otherwise we'll face buffer overflow.

Another option is to not collect any and all contacts, or divide them into few parts.
For example, pages 1-2000, then 2001-4001 and so on -- it's easy to accomplish (once I'll be home i'll try to make limited selector)

8ternity · November 12, 2018, 2:15pm

I found one problem on data exporation. I've written "City, province, postal code" with coma separation in the title name, so when all datas are extracted and open in Excel, the coma separate the data also in the first line.

Also, if you can help me with stopping every 2000 records, i will extract in batch. I don't know how to do it for now. Will wait for you.

Did you think the coma issue can crash the extraction or it's just a buffer issue. I can install CouchDB too if it's help.

Thanks a lot for you help. It's really appreciated.

iconoclast · November 12, 2018, 10:59pm

Okay, I've managed to make it working.

You will have to install Tampermonkey extension beforehand, and import script I've made for you below:

// ==UserScript==
// @name         centris
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  WebScraper Rocks!
// @author       You
// @match        https://www.centris.ca/en/real-estate-broker*
// @grant        unsafeWindow
// @require      http://code.jquery.com/jquery-2.0.3.min.js
// ==/UserScript==

(function() {
    'use strict';

    $(document).ready(function(){
        // Adding a page number as a class into 'next' button to be able to anchor it, once page opens
        $('#divWrapperPager > ul > li.next > a').addClass($('[class=pager-current]').html().replace(/[^0-9].+\s\d+.\d+$/, ''));
        // Removing any class inside button to overwrite it with new page number once page change
       $('[class=next]').click(function(){
            $('[class=next] > a').removeClass();
            //Assigning new page number to it.
            $('[class=next] > a').addClass($('[class=pager-current]').html().replace(/[^0-9].+\s\d+.\d+$/, ''));
       });
    });
})();

It will automatically replicate a page number that is shown to you into 'next' button, so we will be able to catch specific values (e.g. a page number to stop the scrape).

Here's how it looks like inside elements tree:

Now, if you want to scrape records until page 2000 (for example), your selector should look like:

div.right-section li.next:not(.inactive) a:not(.2001)

li.next:not(.inactive) -- will make WebScraper stop if a Next button becomes greyed out (e.g. disabled but yet still working on this particular website)
a:not(.2001) -- will stop WebScraper from 'seeing' a 'link' to a next page if it contains 2001 in it's class.

bretfeig · November 14, 2018, 2:39am

I've managed to make it work by using element select for link select into the records. This way it would cycle through all the records before starting to scrape. It's taken 24 hours but I've got 300 out of 13,000 records to go before I'm done. Will update you guys and post the date when it's done.

bretfeig · November 14, 2018, 9:29am

I managed to get all but 8 records.. Not quite sure why but it took 24 hours to run.

Here you go

8ternity · November 14, 2018, 11:36pm

I was able to download all of them with couchDB and my asus Rog Strix. I try on 3 computer and this one was able to doing all of them.

Thanks for your help. I will have another project this week or next one. It’s important to understand my mistake for next projects.

8ternity · November 21, 2018, 2:50am

Hi @bretfeig and @iconoclast !

Just to give you some news. i've made over 500 letters that have been sent with the data i've been collected. Prety usefull. I've another project next month to find another kind of datas because i know a software called Info Canada which having theses kind of data with all contact but print or export any datas, the software charging for the montly usage and also on printing or exporting records. I found this website and really proud that i find an easy tool that will do the job and help me in the marketing process of our company.

Thanks for your help and precious time. It's appreciated.

Crossed finger that we will have positive feedback on our marketing campaign.

Thanks.