Hi,
I run a Linux machine, so this may not be of help to you. This is the scraper I start with - just fill in the start URL.
{"_id":"dirtree_scraper","startUrl":[""],"selectors":[{"id":"dir","type":"SelectorLink","parentSelectors":["_root","dir"],"selector":"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link","multiple":true,"delay":""},{"id":"dirname","type":"SelectorText","parentSelectors":["_root","dir"],"selector":"div.od-FieldRenderer-Renderea-withMetadata > span a.ms-Link","multiple":true,"regex":"","delay":0}]}
I then save the data as a CSV file, which is read by the following script. This script will build up a list of all URLs in "all_sites.txt" (delete it before scraping a new site). This script will output a new scraper to the terminal, which I import as a new scraper.
sed 's/"//g; s/ /%20/g; s/\r//g' dirtree_scraper.csv | tail -n+2 | awk -F, '
{
printf "\""
printf "\"" >> "all_sites.txt"
if ($4 == "") {
printf "%s", $2
printf "%s", $2 >> "all_sites.txt"
}
else {
printf "%s", $4
printf "%s", $4 >> "all_sites.txt"
}
print "%2F" $5 "\""
print "%2F" $5 "\"" >> "all_sites.txt"
} ' | sort -u | paste -sd,| awk '
{ print $0 }
BEGIN { print "{\"_id\":\"dirtree_generated" PROCINFO["pid"] "\",\"startUrl\":[" }
END { print "],\"selectors\":[{\"id\":\"dir\",\"type\":\"SelectorLink\",\"parentSelectors\":[\"_root\",\"dir\"],\"selector\":\"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link\",\"multiple\":true,\"delay\":\"\"},{\"id\":\"dirname\",\"type\":\"SelectorText\",\"parentSelectors\":[\"_root\",\"dir\"],\"selector\":\"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link\",\"multiple\":true,\"regex\":\"\",\"delay\":0}]}" }
'
When I'm fairly certain all folders have been collected, I run the following script to generate a scraper for the full listing.
sed 's/id=.*id=/id=/g ; s/_/%5F/' all_sites.txt | sort -u | paste -sd, | awk '
{ print $0 }
BEGIN { print "{\"_id\":\"dirtree_filesonly" PROCINFO["pid"] "\",\"startUrl\":[" }
END { print "],\"selectors\":[{\"id\":\"files\",\"type\":\"SelectorText\",\"parentSelectors\":[\"_root\"],\"selector\":\"a.ms-Link\",\"multiple\":true,\"regex\":\"\",\"delay\":0}]}"}
'
The result of this final scraper is the full file and directory listing. The process is a bit of a bother, and there is no real way to detect when the full listing of directories have been collected, and it can list a folder multiple times. Never got around to detecting duplicates. In the end I figured out how to use an application access token and wrote a script that directly access the website. Much more stable and easier to use.