Double lift ... Scroll down does not work and then pages crash (WS) How to?

MqzakpoOk · April 5, 2020, 2:11pm

This website is weird ...
It's made with AJAX or WS i don't know ...
But its been very challenging to get anything

First there is two lift on the right hand side
So the scroll down does not work

So i've then tried to to it with a mouse macro to record my scrolling before starting the sitemap ...
The problem is that after a lot of loading ... When i arrive at the B terrers ... It crash (the RAM i guess,

I looked at a similar topic saying it should be by blocks of 10 to 25 ... The problem is that the scrolldown selector does not work for me ... Thank you in advance for your help !

Url: https://www.irmawork.com/annuaire/recherche?q=a

Sitemap:
{"_id":"irmawork","startUrl":["https://www.irmawork.com/annuaire/recherche?q=a"],"selectors":[{"id":"contact","type":"SelectorLink","parentSelectors":["scroldown"],"selector":".jss263 > a","multiple":true,"delay":"15000000"},{"id":"fiche2","type":"SelectorLink","parentSelectors":["contact"],"selector":"a.jss136","multiple":true,"delay":0},{"id":"info2","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(2) p:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"Info3","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(3) p","multiple":false,"regex":"","delay":0},{"id":"Info4","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"Info5","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"Nom contact 1","type":"SelectorText","parentSelectors":["fiche2"],"selector":"li:nth-of-type(1) p","multiple":false,"regex":"","delay":0},{"id":"Nom contact 2","type":"SelectorText","parentSelectors":["fiche2"],"selector":"li:nth-of-type(2) p","multiple":false,"regex":"","delay":0},{"id":"Job title 1","type":"SelectorText","parentSelectors":["fiche2"],"selector":"li:nth-of-type(1) span","multiple":false,"regex":"","delay":0},{"id":"Job title 2","type":"SelectorText","parentSelectors":["fiche2"],"selector":"li:nth-of-type(2) span","multiple":false,"regex":"","delay":0},{"id":"Autre activité","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(7) ul","multiple":false,"regex":"","delay":0},{"id":"type","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(1) h5","multiple":false,"regex":"","delay":0},{"id":"tel","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(3) p:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"Titre","type":"SelectorText","parentSelectors":["fiche2"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"ville","type":"SelectorText","parentSelectors":["fiche2"],"selector":"p:nth-of-type(2)","multiple":true,"regex":"","delay":0},{"id":"plce","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(2) p","multiple":true,"regex":"","delay":0},{"id":"all","type":"SelectorText","parentSelectors":["fiche2"],"selector":"div:nth-of-type(1) h5, p, div:nth-of-type(2) h6","multiple":true,"regex":"","delay":0},{"id":"scroldown","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div.row.ng-scope:nth-of-type(-n+10)","multiple":true,"delay":"3000"}]}

leemeng · April 6, 2020, 8:24am

Ya I think you're running into RAM limits.

Better you do scrape this site in two stages, where in stage 1, you get all the artiste Urls from the main page, then in stage 2 you have a different sitemap which uses all those Urls as Starturls. You should scrape it in smaller batches, 3000-4000 Urls max at a time.

MqzakpoOk · April 7, 2020, 7:50am

Hi there Leemeng,

Thank you very much,
I'm gonna try something with that recommendation by going to the robots.txt

Then i find the sitemap => find sitemap for persons

And then from there i think i know how to do.

Thank you so much for your help

MqzakpoOk · April 8, 2020, 6:17pm

Hi there @leemeng,

Little update for this subject,
Your solution definitively work ...

But i'm experiencing again a frustrating result,

Well, to get the contact on this website, you need to be connected,
So i connect before launching the scrap and i remember my password into my browser

The problem is that after 30 scrap or URL or so, session is times out ...
And i'm then considered as not connected anymore

I wonder if it is because i'm using a proxy ? I first set it on rotating IP but then i changed to long session but eventually it did the same.

Here is the sitemap if you want to try and see for yourself :

{"_id":"irmawork4","startUrl":["https://www.irmawork.com/annuaire/activite/bureaux-d-etudes-et-consultants/124699-3sa-conseil 2020-02-11","https://www.irmawork.com/annuaire/activite/artistes/113547-3somesisters 2017-03-21","https://www.irmawork.com/annuaire/activite/labels-et-maisons-de-disques/28196-3wm8mw3-et-trheesome.com 2017-01-25","https://www.irmawork.com/annuaire/activite/federations-et-reseaux-regionaux/120281-4-ass-et-plus 2019-07-17","https://www.irmawork.com/annuaire/activite/salles-de-moins-de-400-places/78527-4-bis-salle-de-concerts 2019-09-06","https://www.irmawork.com/annuaire/activite/salles-de-moins-de-400-places/110-4-ecluses 2019-06-28","https://www.irmawork.com/annuaire/activite/salles-de-moins-de-400-places/101729-4-elements 2017-01-31","https://www.irmawork.com/annuaire/activite/scenographes-createurs-lumieres-et-vj-s/121773-4-eleven 2018-07-05","https://www.irmawork.com/annuaire/activite/entreprises-de-sonorisation-et-d-eclairage/121774-4-eleven 2018-07-05","https://www.irmawork.com/annuaire/activite/artistes/75834-40-batteurs 2019-04-30","https://www.irmawork.com/annuaire/activite/formations-artistiques/75835-40-batteurs 2019-07-22","https://www.irmawork.com/annuaire/activite/agents-entrepreneurs-de-spectacles/119474-438-productions 2017-12-07","https://www.irmawork.com/annuaire/activite/bureaux-d-etudes-et-consultants/113116-44 2015-09-29","https://www.irmawork.com/annuaire/activite/fabrication-distribution-de-materiels-studio/110753-44.1 2018-07-31","https://www.irmawork.com/annuaire/activite/labels-et-maisons-de-disques/118-442eme-rue 2018-10-01"],"selectors":[{"id":"CP + Ville","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(2) p:nth-of-type(2)","multiple":true,"regex":"","delay":0},{"id":"Type / genre","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(3) p","multiple":false,"regex":"","delay":0},{"id":"Email","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"Website","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"Nom contact 1","type":"SelectorText","parentSelectors":["_root"],"selector":"li:nth-of-type(1) p","multiple":false,"regex":"","delay":0},{"id":"Nom contact 2","type":"SelectorText","parentSelectors":["_root"],"selector":"li:nth-of-type(2) p","multiple":false,"regex":"","delay":0},{"id":"Job title 1","type":"SelectorText","parentSelectors":["_root"],"selector":"li:nth-of-type(1) span","multiple":false,"regex":"","delay":0},{"id":"Job title 2","type":"SelectorText","parentSelectors":["_root"],"selector":"li:nth-of-type(2) span","multiple":false,"regex":"","delay":0},{"id":"Autre activité","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(7) ul","multiple":false,"regex":"","delay":0},{"id":"type","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(1) h5","multiple":false,"regex":"","delay":0},{"id":"Titre","type":"SelectorText","parentSelectors":["_root"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"Rue","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(2) p:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"All2","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(3)","multiple":true,"regex":"","delay":0},{"id":"adresse2","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(2) p:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"Adresse 3","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(2) p:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"Tel3","type":"SelectorText","parentSelectors":["_root"],"selector":"div.jss704","multiple":false,"regex":"","delay":0},{"id":"All4","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"tel5","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(n+3), div:nth-of-type(2) h6, div:nth-of-type(1) h5","multiple":true,"regex":"","delay":0}]}

MqzakpoOk · May 4, 2020, 3:45pm

Hi there,

I kinda found a way out for my problem by using "Editthecookie" as a chrome extension.
My problem was that after 30 minutes being loged-in the website,

A cookie was automaticly kicking me out of the website ... Leaving me unable to get the data i wanted
So every 30 minutes, i would get disconnected. And unless i would re-put the credential, the sitemap would go on and on but without the protected information (email and telephone)

So i used Editthecookie to change the value of expiration of my credentials. I couldn't really try as another problem raised since few days ago.

It seems that they have seen that their database was targetted and now ... impossible to go to any of the targetted URL ...

For example, I can manually, from this URL : https://www.irmawork.com/annuaire/recherche?q=a

I can click on the first one that appear. Here is the exact URL of the targetted URL : https://www.irmawork.com/annuaire/activite/bureaux-d-etudes-et-consultants/101283-allez-zou-l-agence

But now ... If i target this url within my sitemap, it automatically brings me to : https://www.irmawork.com/annuaire

How is it possible @iconoclast ? Is there a way to avoid this redirection ? I seen that with other scrapping tool, you can change a meta data. Would that allows to the targetted url ?

Thank you for your help