Organizing French Dictionary Scrape

rembrandt · May 13, 2022, 11:26am

I am trying to scrape from a French dictionary site to make flashcards, and in order to make my life a whole lot easier, I cannot, for the life of me, figure out how to format the data properly.

Here's what I mean:

Current Progress:

Title | Phonetic | POS | Label | Quote Span | Form | Quote
Title | Phonetic | POS | Label | Quote Span | Form | Another Quote
Title | Phonetic | POS | Label | Quote Span | Another Form | Quote
Title | Phonetic | POS | Label | Quote Span | Another Form | Another Quote

Aim:

Title | Phonetic | POS | Label | Quote Span | Form | Quote | POS | Label Quote Span | Form | Quote etc. etc. with all of the data for one word on one line.

Here's an explanation of the selectors:

URL: https://www.collinsdictionary.com/dictionary/french-english/partir (just an example)

Sitemap:
{"_id":"frenchcollinsdictionarybase","startUrl":["https://pastelink.net/z8mt2ijb"],"selectors":[{"id":"dictionary","parentSelectors":["links"],"type":"SelectorElement","selector":"div.dc","multiple":true,"delay":0},{"id":"title","parentSelectors":["dictionary"],"type":"SelectorText","selector":".h2_entry span","multiple":false,"delay":0,"regex":""},{"id":"phonetic","parentSelectors":["dictionary"],"type":"SelectorText","selector":".form span.pron","multiple":false,"delay":0,"regex":""},{"id":"hom","parentSelectors":["dictionary"],"type":"SelectorElement","selector":"div.hom","multiple":true,"delay":0},{"id":"pos","parentSelectors":["hom"],"type":"SelectorText","selector":"span.pos","multiple":false,"delay":0,"regex":""},{"id":"sense","parentSelectors":["hom"],"type":"SelectorElement","selector":"div.sense","multiple":true,"delay":0},{"id":"label","parentSelectors":["sense"],"type":"SelectorText","selector":"> span.gramGrp, span.lbl","multiple":false,"delay":0,"regex":""},{"id":"quote span","parentSelectors":["sense"],"type":"SelectorText","selector":"> span span.quote","multiple":false,"delay":0,"regex":""},{"id":"divre","parentSelectors":["sense"],"type":"SelectorElement","selector":"div.re","multiple":true,"delay":0},{"id":"form","parentSelectors":["divre"],"type":"SelectorText","selector":"span.form","multiple":false,"delay":0,"regex":""},{"id":"quote","parentSelectors":["divre"],"type":"SelectorText","selector":"span.quote","multiple":false,"delay":0,"regex":""},{"id":"links","parentSelectors":["_root"],"type":"SelectorLink","selector":".body-display a","multiple":true,"delay":0}]}

rembrandt · May 16, 2022, 9:12pm

Hey there, just checking again if anyone knows how to reorganize data this way in any other program?

ViestursWS · May 17, 2022, 2:18pm

@rembrandt Hi, if are you looking to extract all of the available titles, labels, quotes, etc. in a single line you should be using the 'Grouped' selector instead. You can also create multiple selector variants by dividing them with a comma.

Learn more: Grouped selector | Web Scraper Documentation

rembrandt · May 19, 2022, 6:59pm

Hey there, I am really sorry for all of the trouble, but I don't think I quite understand.

Whenever I use the grouped selector, for instance on the senses, I don't receive any word breaks.
When I use commas in the selector, then the data output does not come in rows.

Again, I am very sorry for all of this trouble, and the late reply, I appreciated your magnanimous efforts.

ViestursWS · May 20, 2022, 8:16am

@rembrandt Can you send a practical example of what the final data should be like for any of these words?

rembrandt · May 20, 2022, 12:19pm

This is what the data should look like for the entry "partir":

etc. ect. until reaching the next sense, at which point the sequence repeats
etc. ect. until reaching the next part of speech (pos), at which point the sequence repeats

And so everything should be on one line.

If I explained it badly, please explain where it was confusing, otherwise, thank you for replying so quickly!

ViestursWS · May 23, 2022, 2:48pm

@rembrandt If you are not looking to separate each of the 'divre' elements you can use the 'Grouped' selector instead.

{"_id":"frenchcollinsdictionarybase","startUrl":["https://pastelink.net/z8mt2ijb"],"selectors":[{"delay":0,"id":"dictionary","multiple":true,"parentSelectors":["links"],"selector":"div.dc","type":"SelectorElement"},{"delay":0,"id":"title","multiple":false,"parentSelectors":["dictionary"],"regex":"","selector":".h2_entry span","type":"SelectorText"},{"delay":0,"id":"phonetic","multiple":false,"parentSelectors":["dictionary"],"regex":"","selector":".form span.pron","type":"SelectorText"},{"delay":0,"id":"hom","multiple":true,"parentSelectors":["dictionary"],"selector":"div.hom","type":"SelectorElement"},{"delay":0,"id":"pos","multiple":false,"parentSelectors":["hom"],"regex":"","selector":"span.pos","type":"SelectorText"},{"delay":0,"id":"sense","multiple":true,"parentSelectors":["hom"],"selector":"div.sense","type":"SelectorElement"},{"delay":0,"id":"label","multiple":false,"parentSelectors":["sense"],"regex":"","selector":"> span.gramGrp, span.lbl","type":"SelectorText"},{"delay":0,"id":"quote span","multiple":false,"parentSelectors":["sense"],"regex":"","selector":"> span span.quote","type":"SelectorText"},{"delay":0,"extractAttribute":"","id":"divre","parentSelectors":["sense"],"selector":"div[class=\"cit type-example\"], div[class=\"re type-phr\"]","type":"SelectorGroup"},{"delay":0,"id":"links","multiple":true,"parentSelectors":["_root"],"selector":".body-display a","type":"SelectorLink"}]}

rembrandt · May 26, 2022, 7:27pm

@ViestursWS Yes, I am looking to join the 'divre' elements, but also all of the other elements to fit on one line. This includes the 'pos' to be next to each other rather than descending in a table. Basically, for one word, all of the data should be on one line. I am not sure this is possible, but thank you for trying.