Obfuscated links or just my incompetence?

I'm still somewhat new to this but have managed to get through several successful jobs with Web Scraper. This one is stumping me though: I can't access the links in these Internet Archive search results using any selector configuration whatsoever, so I can't even get started with a sitemap.

The search hit URLs I want to access are readily available by hovering or clicking through, but the selector only gives me "Parent does not contain the selected element". :woozy_face:

https://archive.org/details/texts?tab=collection&query=creator%3AChorev

Hi,

The issue occurs due the content being nested under various shadow-roots:

So this is an edge case where the point-and-click won't work and the selectors will have to be constructed manually by inspecting the HTML.

Here is a reference on how to access data in a shadow-root:

{"_id":"archive-org","startUrl":["https://archive.org/details/texts?tab=collection&query=creator%3AChorev"],"selectors":[{"id":"link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"app-root:shadow-root collection-page:shadow-root collection-browser:shadow-root infinite-scroller:shadow-root tile-dispatcher:shadow-root a","type":"SelectorLink"},{"id":"title","multiple":false,"parentSelectors":["link"],"regex":"","selector":"span[itemprop='name']","type":"SelectorText"},{"id":"Identifier","multiple":false,"parentSelectors":["link"],"regex":"","selector":"span[itemprop='identifier']","type":"SelectorText"},{"id":"publisher","multiple":false,"parentSelectors":["link"],"regex":"","selector":"span[itemprop='publisher']","type":"SelectorText"}]}
1 Like

Wow, thank you so much. It works well and I've learned something new. It's some consolation that the solution wasn't right in front of my eyes.

1 Like

Not sure if I should revive this thread or start a new one? With the generous help above I've made good use of a sitemap with Internet Archive book search hits but I have been unable to collect more than a page of search hits at a time.

It seems like a typical infinite scroll scenario with no buttons to advance pages, but my efforts to use SelectorElementScroll have failed. I assume the issue is what marker to use for the scroll down element. I don't know if the shadow-root links require a special setup or if I am just not structuring this properly - increasing the scroll delay had no effect.

{"_id":"Archive-org-Test","startUrl":["https://archive.org/details/texts?tab=collection&query=NLB&and[]=collection%3A%22internetarchivebooks%22"],"selectors":[{"id":"element-wrapper","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"app-root:shadow-root collection-page:shadow-root collection-browser:shadow-root infinite-scroller:shadow-root tile-dispatcher:shadow-root a","multiple":true,"delay":2000,"elementLimit":0},{"id":"link","parentSelectors":["element-wrapper"],"type":"SelectorLink","selector":"app-root:shadow-root collection-page:shadow-root collection-browser:shadow-root infinite-scroller:shadow-root tile-dispatcher:shadow-root a","multiple":true,"linkType":"linkFromHref"},{"id":"name","parentSelectors":["link"],"type":"SelectorText","selector":"span[itemprop='name']","multiple":false,"regex":""},{"id":"identifier","parentSelectors":["link"],"type":"SelectorText","selector":"span[itemprop='identifier']","multiple":false,"regex":""},{"id":"publisher","parentSelectors":["link"],"type":"SelectorGroup","selector":"span[itemprop='publisher']","extractAttribute":""}]}

Nah, just that this site is a bit tricky due to multiple levels of shadow DOM (shadow-root). So the selector is going to be a long one. Try the sitemap below. I used Page load delay: 4000. I also set the Element limit to 250 for testing; adjust as needed:

{"_id":"archive-org-test-b","startUrl":["https://archive.org/details/texts?tab=collection&query=NLB&and[]=collection%3A%22internetarchivebooks%22"],"selectors":[{"elementLimit":250,"id":"scroller","multiple":true,"parentSelectors":["_root"],"scroll":true,"selector":"app-root:shadow-root collection-page:shadow-root collection-browser:shadow-root infinite-scroller:shadow-root tile-dispatcher:shadow-root a","type":"SelectorElement"},{"id":"Title","multiple":false,"parentSelectors":["scroller"],"regex":"","selector":"item-tile:shadow-root h4","type":"SelectorText"},{"id":"Author","multiple":false,"parentSelectors":["scroller"],"regex":"","selector":"item-tile:shadow-root span","type":"SelectorText"},{"id":"Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["scroller"],"selector":"_parent_","type":"SelectorLink"}]}

1 Like

This is great, thank you! It works well for me in Chrome.

The sitemap imports differently for me in Firefox, with the Element type so it doesn't scroll and pick up all the search hits. Even if I change the Type to Element scroll down it only finds the elements visible on the page. Still, I am happy to use Chrome and may have done something wrong in Firefox.

1 Like