Scraping directory tree

raahlb · May 8, 2019, 7:23am

I'm trying to scrape a Sharepoint file storage - i.e. directory tree structure with files. Is this possible to do with Webscraper? As it looks to me, it just goes down one directory and then never goes back up and takes the next one. I've tried both Link Selector and Element Click Selector. Neither seems to do what I want it to.

As it's an internal site, I sadly cannot give you an example link. The question should be general enough, though. Below is my latest attempt.

{"_id":"sysver","startUrl":["https://secretinternalsite.sharepoint.com/sites/roboticsprojectstorage/Shared%20Documents/Forms/AllItems.aspx?id=%2Fsites%2Fblablabla%2FShared%20Documents%2Ftest_results%2FSystemVerificationOctober2016%20-%2028.130%2FBumpy_wall"],"selectors":[{"id":"dir","type":"SelectorLink","parentSelectors":["_root","dir"],"selector":"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link","multiple":true,"delay":""},{"id":"files","type":"SelectorText","parentSelectors":["_root","dir"],"selector":"a.ms-Link","multiple":true,"regex":"","delay":0}]}

raahlb · May 8, 2019, 9:02am

Some more testing - it might just be that the link selector doesn't work because of javascript used to load the content. But, from what I could see the element clicker selector just went down to a leaf of the tree, and never went back. Also, it didn't collect anything on the way down, either.

raahlb · May 8, 2019, 3:40pm

Actually, Link Selector should work, as Sharepoint also have clickable links. However, it seems to freeze sometimes.

Regardless, I ended up making a solution where I have a scraper that collects urls to all ascendants of a folder, then I run a script to generate the same scraper, but with those ascendant folders as start urls. Normally this only collects one level of folders, so I have to run this process multiple times. A bit bothersome, but works.

I create a list of all these start urls, then feed them into a second type of scraper, which only collect file names and never descends, generating a complete file listing.

Annoyingly, now when I'm running the first scrape, it actually goes down most folders. Not sure why - it seems to be a bit random, if it just finishes or continues on down. I have noted that forcefully reloading the page can help - does WebScraper not understand that the page is loaded?

anilsen · July 26, 2019, 2:51pm

raahlb, I'm interested in doing something similar. Can you explain how you "create a list of all these start urls, then feed them into a second type of scraper, which only collect file names and never descends, generating a complete file listing." I think this would help me do what I need to do, but I'm not sure I understand it completely.

raahlb · July 29, 2019, 11:24am

Hi,
I run a Linux machine, so this may not be of help to you. This is the scraper I start with - just fill in the start URL.

{"_id":"dirtree_scraper","startUrl":[""],"selectors":[{"id":"dir","type":"SelectorLink","parentSelectors":["_root","dir"],"selector":"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link","multiple":true,"delay":""},{"id":"dirname","type":"SelectorText","parentSelectors":["_root","dir"],"selector":"div.od-FieldRenderer-Renderea-withMetadata > span a.ms-Link","multiple":true,"regex":"","delay":0}]}

I then save the data as a CSV file, which is read by the following script. This script will build up a list of all URLs in "all_sites.txt" (delete it before scraping a new site). This script will output a new scraper to the terminal, which I import as a new scraper.

sed 's/"//g; s/ /%20/g; s/\r//g' dirtree_scraper.csv | tail -n+2 | awk -F, '
{
 printf "\""
 printf "\"" >> "all_sites.txt"
 if ($4 == "") {
    printf "%s", $2
    printf "%s", $2 >> "all_sites.txt"
  }
  else {
    printf "%s", $4
    printf "%s", $4 >> "all_sites.txt"
  }
  print  "%2F" $5 "\""
  print  "%2F" $5 "\"" >> "all_sites.txt"
 } ' | sort -u | paste -sd,| awk '
{ print $0 }
BEGIN { print "{\"_id\":\"dirtree_generated" PROCINFO["pid"] "\",\"startUrl\":[" }
END { print "],\"selectors\":[{\"id\":\"dir\",\"type\":\"SelectorLink\",\"parentSelectors\":[\"_root\",\"dir\"],\"selector\":\"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link\",\"multiple\":true,\"delay\":\"\"},{\"id\":\"dirname\",\"type\":\"SelectorText\",\"parentSelectors\":[\"_root\",\"dir\"],\"selector\":\"div.od-FieldRenderer-Renderer-withMetadata > span a.ms-Link\",\"multiple\":true,\"regex\":\"\",\"delay\":0}]}" }
'

When I'm fairly certain all folders have been collected, I run the following script to generate a scraper for the full listing.

sed 's/id=.*id=/id=/g ; s/_/%5F/'  all_sites.txt | sort -u | paste -sd, | awk '
{ print $0 }
BEGIN { print "{\"_id\":\"dirtree_filesonly" PROCINFO["pid"] "\",\"startUrl\":[" }
END { print "],\"selectors\":[{\"id\":\"files\",\"type\":\"SelectorText\",\"parentSelectors\":[\"_root\"],\"selector\":\"a.ms-Link\",\"multiple\":true,\"regex\":\"\",\"delay\":0}]}"}
'

The result of this final scraper is the full file and directory listing. The process is a bit of a bother, and there is no real way to detect when the full listing of directories have been collected, and it can list a folder multiple times. Never got around to detecting duplicates. In the end I figured out how to use an application access token and wrote a script that directly access the website. Much more stable and easier to use.