Unknown subdirectory

I'm trying to pull data, where the subdomain ie. 'cut-collective' would be unknown but where I can pull all pages after /profile/ if that's possible?

Can I just add something after /profile/ in the metadata? Or is it through selectors?

Thanks

Sitemap:
{"_id":"bigid","startUrl":["https://www.thebigidea.nz/profile/cut-collective"],"selectors":[{"id":"Name","type":"SelectorText","selector":"h1.user-title","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Type","type":"SelectorText","selector":"div.field.field-name-field-is-organisation","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests","type":"SelectorText","selector":"div.group-content-left h2","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests Content","type":"SelectorText","selector":"div.group-content-left p:nth-of-type(1)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"more","type":"SelectorText","selector":"div.field p:nth-of-type(2)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Full Bio","type":"SelectorText","selector":"div.field.field-name-field-bio","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Links","type":"SelectorText","selector":"div.field.field-name-field-links a","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0}]}

Hi

do you want to scrape data from all pages like that:

etc

Not sure to understand what you are trying to scrap. Could you be more precise.
Thanks

I'm just after the profile info (name, interests, personal description, links), if that makes sense?

but trying to jump from
profile/cut-collective

to all other profile subdirectories with unknown names...? ie
profile/makeplace
profile/artspace

Thanks so much.

Sorry but I still dont understand where you want to navigate more deeply within this site.

Looking your sitemap, you first collect data on the main page:
https://www.thebigidea.nz/profile/cut-collective

But after, what page(s) do you want to select :
http://cutcollective.co.nz/ ?
https://www.thebigidea.nz/connect/media-releases/2014/mar/139137-art-container-embarks-on-round-the-world-voyage ?

perhaps, you will have to give us your login+psw to follow you

Your

Can scraper go,
https://www.thebigidea.nz/profile/cut-collective ->
https://www.thebigidea.nz/profile/makeplace ->
https://www.thebigidea.nz/profile/artspace
then to next subdirectory but without knowing what it is?

or does it need links to exist within the page to be able to jump? in case of php/numbered pages I can do id=[1-1000] wondering if there's a similar thing to cover all possible word subdirectory..?

Would alt route be pull list from site:. search of all subdirectory, then pull from list of each page?

Hi

now it is more clear with the 3 URL you gave me.

Of course you can build an unique sitemap including these 3 web URL but you have to add each of them by clicking on the "+" to add a 4th, 5 etc .... You have to type exactly the different adress.

Unfortunatly, Web scraper can't understand a syntax like scape all the site like https://www.thebigidea.nz/profile/*, ie all the site that start with the string "https://www.thebigidea.nz/profile/".

I dont know very well this site, but I presume that you can't find inside a link somewhere that gives all the URL starting with "https://www.thebigidea.nz/profile/".

For the 3 URL, here is what you can create as sitemap:

{"_id":"test_thebigidea","startUrl":["https://www.thebigidea.nz/profile/cut-collective","https://www.thebigidea.nz/profile/artspace","https://www.thebigidea.nz/profile/makeplace"],"selectors":[{"id":"Name","type":"SelectorText","selector":"h1.user-title","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Type","type":"SelectorText","selector":"div.field.field-name-field-is-organisation","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests","type":"SelectorText","selector":"div.group-content-left h2","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Interests Content","type":"SelectorText","selector":"div.group-content-left p:nth-of-type(1)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"more","type":"SelectorText","selector":"div.field p:nth-of-type(2)","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Full Bio","type":"SelectorText","selector":"div.field.field-name-field-bio","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Links","type":"SelectorText","selector":"div.field.field-name-field-links a","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0}]}

Thanks for that.

Yer the site deleted the functionality of going through members, doesn't seem like I can fully pull them from waybackmachine.

Is the best idea doing a slower query rate through g-ogle's site:. or are there more scraper friendly engines?

Cheers

Hi
This chrome extension is one of the most powerful and easy to use.
lots of others extensions or softs are not free