Scrape both text and link from table?

privacypro · May 31, 2018, 5:32pm

I'm trying to scrape from a table and include not only the text of the table but the link that each row contains. I've used "Table" element selector which gets very close, but doesn't pull the html link. I used "link" but while that pulls the link it doesn't put it on the same row as the scraped data from the table, which is ideal.

(Actually, if anyone can figure it out, what I'd really like to be able to do is to scrape the table data, then follow the link from each row, then include the link to the .pdf that is displayed on the linked page - if that makes sense; e.g.: the first row on https://oag.ca.gov/privacy/databreach/list is "California Department of Public Health", I'd like to scrape the name, the date the breach was made public, the date the breach occurred AND then get the actual link to the pdf that is included on the page that is linked to on each row, https://oag.ca.gov/system/files/Sample%20CDPH%20Breach%20Notification%20Letter_5_23_18_0.pdf )

I know no CSS or HTML, but I've had some success using the element selectors - is this something that the selectors can't do? Is there something I could be doing better? Thanks! Huge thanks for this program, it's helping us do some good work.

Url: https://oag.ca.gov/privacy/databreach/list

Sitemap:
{"_id":"ca-ag-breachportal","startUrl":["https://oag.ca.gov/privacy/databreach/list"],"selectors":[{"id":"tabledata","type":"SelectorTable","selector":"table.views-table","parentSelectors":["_root"],"multiple":true,"columns":[{"header":"Organization Name","name":"Organization Name","extract":true},{"header":"Date(s) of Breach","name":"Date(s) of Breach","extract":true},{"header":"Reported Date","name":"Reported Date","extract":true}],"delay":0,"tableDataRowSelector":"tbody tr","tableHeaderRowSelector":"thead tr"},{"id":"Link","type":"SelectorLink","selector":"td.views-field a","parentSelectors":["_root"],"multiple":true,"delay":0}]}

privacypro · June 1, 2018, 4:52pm

Any help anyone has would be hugely appreciated! We're well and truly stuck on our end.

iconoclast · June 10, 2018, 6:11pm

Hi!

At the moment it's not possible to get such results using just one sitemap.

It's easier to first run scraper to get table data, and then, using another sitemap, scrape every page with PDF using link+text selectors, and then use excel or any other cell editing tool to merge the results.

Please keep in mind that results will be shown bottom-up.

endurancescout · June 14, 2018, 7:21pm

I'm also really interested in being able to scrape both links and text from a table.

In my case, there's a name and email column. If I use table, it scrapes the text "email" rather than the email address. If I select link-multiple I would be required to manually connect the name and email columns post scraping.

iconoclast · June 14, 2018, 7:47pm

@endurancescout

When using Table selector, you can manually select needed rows using Browser's built in Element Selection Tool (Ctrl + Shift + C) and then adding it to the selector.

You can also put as much items in selector as you want, they all must be separated by a comma -- the result for such selector will be merged into one line.

malamut · July 5, 2018, 10:46pm

This solution does not work for me. The table selection is fine but when I try to add anything using the comma seperator, I get no output for anything after the comma.
I am trying to do the exact same thing - get a table from a page and have it include the href link for the element(s) in each row that have a link. This seems like a common use case.

iconoclast · July 6, 2018, 8:23pm

@privacypro

I was so wrong . . .

After awhile I've decided to try another approach, and succeeded!

It's tricky one (re-written for better understanding):

You create an Element selector, to group scraped results so they will look as a row, that corresponds with the table. This is parent element that will contain all child elements of scraped data kept ordered.
Since it's a table (that we want to scrape), it consists of a few basic elements: th (table head), tr (table row), and td (table data). Please refer to W3Schools website for better understanding what table is.
Then you select all tr elements (table rows) for it:

image.png1241×576 40.4 KB

Now we have all rows selected.

After creating Element selector, you create text and link selectors inside it to build your grouped data rows.
After creating Link selector to have links inside table rows to go to pages, you will notice that your parent element wrapper is limited to one row:

image.png1237×508 44.2 KB

As it will go row by row through table, you must select right selector for it to scrape. As well as for text selectors that will pick resting pieces of information from table.
Since you cannot pick the correct table data (td) you have to use Browser's Inspect Tool,

image.png1229×539 64.9 KB

You see now that it's td.views-field-field-sb24-org-name that has company name text. Going further on tree, you will notice that it contains linked action (a) to page (href=...). So a correct selector would look like: views-field-field-sb24-org-name a
That procedure is needed to be done for each table cell to display correct row in the end.
I think you can figure out how to do selectors to pick PDF href attribute on corresponding page -- you create selectors inside Link selectors to pick data accordingly.

Here's a sitemap:

{"_id":"breachportal","startUrl":["https://oag.ca.gov/privacy/databreach/list"],"selectors":[{"id":"grouped","type":"SelectorElement","selector":"tr:nth-of-type(n+1)","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"link","type":"SelectorLink","selector":"td.views-field.views-field-field-sb24-org-name a","parentSelectors":["grouped"],"multiple":true,"delay":0},{"id":"DoB","type":"SelectorText","selector":"td.views-field.views-field-field-sb24-breach-date","parentSelectors":["grouped"],"multiple":false,"regex":"","delay":0},{"id":"repdate","type":"SelectorText","selector":"td.views-field.views-field-created","parentSelectors":["grouped"],"multiple":false,"regex":"","delay":0},{"id":"link2pdf","type":"SelectorElementAttribute","selector":"span.file a","parentSelectors":["link"],"multiple":false,"extractAttribute":"href","delay":0}]}

And some of the results (i've stopped the scraper as it looks it will go to the end)

privacypro · July 6, 2018, 8:48pm

@iconoclast Anton, thank you so much, this is incredibly helpful!! What a great little tool you've made. I'm not 100% sure I understand your explanation (maybe some screenshots of screencasts of that would be better at getting it across - if you have the time that would also be incredibly helpful), but I hope to be able to use this method to scrape from more AG websites. Eventually scraping from all of these would essentially catch all the data breaches publicly reported - a resource that simply isn't available today (bizarrely). If you get some free time and want to test out your tool's flexibility, I'm going to be trying to work through all of these and your expertise would be an enormous boon.
• Delaware: https://attorneygeneral.delaware.gov/fraud/cpu/securitybreachnotification/database/
• Indiana: http://www.in.gov/attorneygeneral/2874.htm
• Iowa: https://www.iowaattorneygeneral.gov/for-consumers/security-breach-notifications/2018-security-breach-notifications/
• Maine: http://www.maine.gov/ag/consumer/identity_theft/
• Maryland: http://www.marylandattorneygeneral.gov/Pages/IdentityTheft/breachnotices.aspx
• Massachusetts: https://www.mass.gov/service-details/data-breach-notification-reports
• Montana: https://dojmt.gov/consumer/consumers-known-data-breach-incidents/
• New Hampshire: https://www.doj.nh.gov/consumer/security-breaches/
• New Jersey: https://www.cyber.nj.gov/data-breach-notifications/
• Oregon: https://justice.oregon.gov/consumer/DataBreach/Home/
• Vermont: http://ago.vermont.gov/archived-security-breaches/
• Washington: http://www.atg.wa.gov/data-breach-notifications
• Wisconsin: https://datcp.wi.gov/Pages/Programs_Services/DataBreaches.aspx

iconoclast · July 6, 2018, 11:35pm

I've rewitten the procedure

bretfeig · July 7, 2018, 1:25am

Try this: I think this get's what you want

{"_id":"databreach","startUrl":["https://oag.ca.gov/privacy/databreach/list"],"selectors":[{"id":"Row","type":"SelectorElement","selector":"tbody tr","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"name","type":"SelectorText","selector":"td.views-field a","parentSelectors":["Row"],"multiple":false,"regex":"","delay":0},{"id":"Date of Breach","type":"SelectorText","selector":"span.date-display-single:nth-of-type(1)","parentSelectors":["Row"],"multiple":false,"regex":"","delay":0},{"id":"reported date","type":"SelectorText","selector":"td.views-field.views-field-created","parentSelectors":["Row"],"multiple":false,"regex":"","delay":0},{"id":"link","type":"SelectorLink","selector":"td.views-field a","parentSelectors":["Row"],"multiple":false,"delay":0},{"id":"PDF","type":"SelectorElementAttribute","selector":"span.file a","parentSelectors":["link"],"multiple":false,"extractAttribute":"href","delay":0}]}

bretfeig · July 7, 2018, 1:29am

Each link has a different site-map. It's going to take a bit of work but they don't seem overly complex

privacypro · July 20, 2018, 5:45pm

@bretfeig @iconoclast

So I'm trying to follow your walk through iconoclast for this site now: https://attorneygeneral.delaware.gov/fraud/cpu/securitybreachnotification/database/

This is as far as I've gotten:
{"_id":"databreachimport_delaware","startUrl":["https://attorneygeneral.delaware.gov/fraud/cpu/securitybreachnotification/database/"],"selectors":[{"id":"TableClickThroughPages","type":"SelectorElementClick","selector":"div.col-sm-12","parentSelectors":["_root"],"multiple":true,"delay":0,"clickElementSelector":"li.paginate_button.next a","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"Name","type":"SelectorText","selector":"td.sorting_1","parentSelectors":["TableClickThroughPages"],"multiple":true,"regex":"","delay":0},{"id":"DateOfBreach","type":"SelectorText","selector":"td:nth-of-type(2)","parentSelectors":["TableClickThroughPages"],"multiple":true,"regex":"","delay":0},{"id":"ReportedDate","type":"SelectorText","selector":"td:nth-of-type(3)","parentSelectors":["TableClickThroughPages"],"multiple":true,"regex":"","delay":0},{"id":"NumberOfDelawareResidentsAffected","type":"SelectorText","selector":"td:nth-of-type(4)","parentSelectors":["TableClickThroughPages"],"multiple":true,"regex":"","delay":0},{"id":"Link","type":"SelectorLink","selector":"td a.icon","parentSelectors":["TableClickThroughPages"],"multiple":true,"delay":0}]}

When I try to inspect an element, I can't figure out what the selector should be titled, because as far as I can see, the rows don't seem to have unique names like the one of the California AG site you made earlier. I'm sure I'm doing something stupid - anyone have any idea what it is?

guru · March 5, 2019, 2:27am

im scraping a table ok but how do i get to do all the next pages with pagination and table only?

guru · March 5, 2019, 3:37am

now that i understand this, here is the data and how i set the sitemap

{"_id":"reftest","startUrl":["http://www.referenceusa.com/UsBusiness/Result/479b3993760f4821bd2d48b5eb68c4af"],"selectors":[{"id":"PAGE","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.pageArea","multiple":true,"delay":"10000","clickElementSelector":"div.menuPagerBar:nth-of-type(1) div.next","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"TABLE","type":"SelectorTable","parentSelectors":["_root","PAGE"],"selector":"table","multiple":true,"columns":[{"header":"Check Box","name":"Check Box","extract":true},{"header":"Company Name","name":"Company Name","extract":true},{"header":"Executive Name","name":"Executive Name","extract":true},{"header":"Street Address","name":"Street Address","extract":true},{"header":"City, State","name":"City, State","extract":true},{"header":"ZIP","name":"ZIP","extract":true},{"header":"Phone","name":"Phone","extract":true},{"header":"Corp. Tree\n\t\t\t\t\n\t\t\t\t \n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tRecord Type\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tTitle\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tEmployees\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tSales","name":"Corp. Tree\n\t\t\t\t\n\t\t\t\t \n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tRecord Type\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tTitle\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tEmployees\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\tSales","extract":false}],"delay":"10000","tableDataRowSelector":"tbody#searchResultsPage tr","tableHeaderRowSelector":"thead tr"}]}