Unknown subdirectory

orangeroughy · May 25, 2018, 12:22pm

I'm trying to pull data, where the subdomain ie. 'cut-collective' would be unknown but where I can pull all pages after /profile/ if that's possible?

Can I just add something after /profile/ in the metadata? Or is it through selectors?

Thanks

Sitemap:
{"_id":"bigid","startUrl":["https://www.thebigidea.nz/profile/cut-collective"],"selectors":[{"id":"Name","type":"SelectorText","selector":"h1.user-title","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Type","type":"SelectorText","selector":"div.field.field-name-field-is-organisation","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests","type":"SelectorText","selector":"div.group-content-left h2","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests Content","type":"SelectorText","selector":"div.group-content-left p:nth-of-type(1)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"more","type":"SelectorText","selector":"div.field p:nth-of-type(2)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Full Bio","type":"SelectorText","selector":"div.field.field-name-field-bio","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Links","type":"SelectorText","selector":"div.field.field-name-field-links a","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0}]}

chefas · May 25, 2018, 2:33pm

Hi

do you want to scrape data from all pages like that:

etc

Not sure to understand what you are trying to scrap. Could you be more precise.
Thanks

orangeroughy · May 26, 2018, 12:13am

I'm just after the profile info (name, interests, personal description, links), if that makes sense?

but trying to jump from
profile/cut-collective

to all other profile subdirectories with unknown names...? ie
profile/makeplace
profile/artspace

Thanks so much.

chefas · May 26, 2018, 9:58am

Sorry but I still dont understand where you want to navigate more deeply within this site.

Looking your sitemap, you first collect data on the main page:
https://www.thebigidea.nz/profile/cut-collective

But after, what page(s) do you want to select :
http://cutcollective.co.nz/ ?
https://www.thebigidea.nz/connect/media-releases/2014/mar/139137-art-container-embarks-on-round-the-world-voyage ?

perhaps, you will have to give us your login+psw to follow you

Your

orangeroughy · May 26, 2018, 10:12am

Can scraper go,
https://www.thebigidea.nz/profile/cut-collective ->
https://www.thebigidea.nz/profile/makeplace ->
https://www.thebigidea.nz/profile/artspace
then to next subdirectory but without knowing what it is?

or does it need links to exist within the page to be able to jump? in case of php/numbered pages I can do id=[1-1000] wondering if there's a similar thing to cover all possible word subdirectory..?

Would alt route be pull list from site:. search of all subdirectory, then pull from list of each page?

chefas · May 26, 2018, 12:22pm

Hi

now it is more clear with the 3 URL you gave me.

Of course you can build an unique sitemap including these 3 web URL but you have to add each of them by clicking on the "+" to add a 4th, 5 etc .... You have to type exactly the different adress.

Unfortunatly, Web scraper can't understand a syntax like scape all the site like https://www.thebigidea.nz/profile/*, ie all the site that start with the string "https://www.thebigidea.nz/profile/".

I dont know very well this site, but I presume that you can't find inside a link somewhere that gives all the URL starting with "https://www.thebigidea.nz/profile/".

For the 3 URL, here is what you can create as sitemap:

{"_id":"test_thebigidea","startUrl":["https://www.thebigidea.nz/profile/cut-collective","https://www.thebigidea.nz/profile/artspace","https://www.thebigidea.nz/profile/makeplace"],"selectors":[{"id":"Name","type":"SelectorText","selector":"h1.user-title","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Type","type":"SelectorText","selector":"div.field.field-name-field-is-organisation","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests","type":"SelectorText","selector":"div.group-content-left h2","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests Content","type":"SelectorText","selector":"div.group-content-left p:nth-of-type(1)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"more","type":"SelectorText","selector":"div.field p:nth-of-type(2)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Full Bio","type":"SelectorText","selector":"div.field.field-name-field-bio","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Links","type":"SelectorText","selector":"div.field.field-name-field-links a","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0}]}

orangeroughy · May 26, 2018, 10:48pm

Thanks for that.

Yer the site deleted the functionality of going through members, doesn't seem like I can fully pull them from waybackmachine.

Is the best idea doing a slower query rate through g-ogle's site:. or are there more scraper friendly engines?

Cheers

chefas · May 27, 2018, 9:39am

Hi
This chrome extension is one of the most powerful and easy to use.
lots of others extensions or softs are not free