Simple Scrape and How to write out selectors manually?

highoctane · April 26, 2018, 1:07am

Apologize for the dumb question in advance but this seems like it should be easy to scrape but because their HTML is organized poorly I'm having trouble.

I would like the blog name and URLs of the list of 50 on this site
Url: https://blogging.com/top-bloggers/

I mostly select what I want manually by clicking but sometimes, as occurs here, it grabs stuff I don't want. Is there a way to deselect elements or is there instruction on how to write out the selector? I know I want the first <a href> tag under the first<p> tag of every <h2> tag.

Sitemap:
{"_id":"blogs","startUrl":["https://blogging.com/top-bloggers/"],"selectors":[{"id":"Name","type":"SelectorText","selector":"h2","parentSelectors":["_root"],"multiple":true,"regex":"","delay":0},{"id":"url","type":"SelectorText","selector":"p:nth-of-type(n+6) a:nth-of-type(1)","parentSelectors":["_root"],"multiple":true,"regex":"","delay":0}]}

CypherConjured · April 26, 2018, 2:07am

this extension uses CSS selectors, you can learn more about them here: https://www.w3schools.com/cssref/css_selectors.asp

Edit: To answer your question more directly, yes you can manually type out the selector in the field instead of clicking select.

Edit2: You may want to try a different tool, the HTML structure of this page makes it very difficult to scrub with this tool.

highoctane · April 26, 2018, 2:37pm

Hi Cypher,

Thank you for replying. Is it the case with this tool if it's difficult to click the elements I want, writing it manually will be just as hard?

Thanks!

CypherConjured · April 26, 2018, 11:18pm

I feel like you could use regular expressions (RegEx) to get all the links, maybe search www.*.* or something on the page and, you'd get just the links that you mentioned, but you would have to match it up manually with the h2 headers. You want a tool that works well with text documents, because that is the way the page is formatted.