Extract Infoboxes from Wikipedia

tomasic · September 17, 2018, 4:11pm

I'm interested in extracting a relational table from a wikipedia infobox. For example, consider the page https://en.wikipedia.org/wiki/Star_Wars:_Episode_IX .. On the right side of the page, there is an info box that, converted to csv would look like this:

key,value
Directed By,J. J. Abrams
Produced By,Kathleen Kennedy
Produced By,J. J. Abrams
Produced By,Michelle Rejwan

etc.

Now, AFAIK webscraper would have trouble generated the duplicate "Produced By" (in the HTML the Produced By is a row header). So any other result that could be easily converted into the above would be great. The page is extraordinarily well labeled, but getting the right "nesting" or matching elements into the selector is tricky.

iconoclast · September 17, 2018, 7:51pm

Hey there!

Ending result does always depend on a particular page you're trying to scrape.
In your case it can be done in a few ways. One is to pick a list of authors into an Element Selector, and then divide them by a child number.

Please import and analyze this sitemap:
{"_id":"wikipedia","startUrl":["https://en.wikipedia.org/wiki/Star_Wars:_Episode_IX"],"selectors":[{"id":"Produced by","type":"SelectorText","selector":"tr:contains('Produced by') th","parentSelectors":["_root"],"multiple":false,"regex":"","delay":0},{"id":"Authors","type":"SelectorElement","selector":"tr:nth-of-type(4) div.plainlist","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"author1","type":"SelectorText","selector":"li:nth-child(1)","parentSelectors":["Authors"],"multiple":false,"regex":"","delay":0},{"id":"author2","type":"SelectorText","selector":"li:nth-child(2)","parentSelectors":["Authors"],"multiple":false,"regex":"","delay":0},{"id":"author3","type":"SelectorText","selector":"li:nth-child(3)","parentSelectors":["Authors"],"multiple":false,"regex":"","delay":0}]}