Multiple languages in one row

I am trying to extract data which are documents in different languages. The structure is like this:

  • start url —> year —> pagination —> title —> document in different languages.

What I get in the CSV file is:
Row 1: start url, year, page, title 1, language 1, document 1 in language 1.
Row 2: start url, year, page, title 1, language 2, document 1 in language 2.
Row 3: start url, year, page, title 1, language 3, document 1 in language 3.
Row 4: start url, year, page, title 2, language 1, document 2 in language 1.
Row 5: start url, year, page, title 2, language 2, document 2 in language 2.
...

But I need to get the data like this:
Row 1: start url, year, page, title 1, language 1, document 1 in language 1, language 2, document 1 in language 2, language 3, document 1 in language 3...
Row 2: start url, year, page, title 2, language 1, document 2 in language 1, language 2, document 2 in language 2, language 3, document 2 in language 3...

Start Url: http://w2.vatican.va/content/benedict-xvi/es/homilies.index.html <— this is where I am trying to extract the data.

Hi!

You just need a properly structured link selectors to do the job.

The results shown below are achieved using CouchDB instance on my machine to keep everything in strict order:

Here's an example sitemap i've built, take a look into it (i've used delays of 500 ms for selectors):

{"_id":"vatican","startUrl":["http://w2.vatican.va/content/benedict-xvi/es/homilies/2005.index.html"],"selectors":[{"id":"YEARS","type":"SelectorLink","selector":"li.has-sub.open li a","parentSelectors":["_root"],"multiple":true,"delay":"500"},{"id":"DOCUMENTS","type":"SelectorLink","selector":"h1 a","parentSelectors":["YEARS"],"multiple":true,"delay":"500"},{"id":"LANGUAGES","type":"SelectorLink","selector":"span.translation a","parentSelectors":["DOCUMENTS"],"multiple":true,"delay":"500"},{"id":"example title","type":"SelectorText","selector":"div.text > p:nth-of-type(1)","parentSelectors":["LANGUAGES"],"multiple":false,"regex":"","delay":0}]}

Thanks for your answer but I think I didn't explain myself correctly.

I imported the sitemap you sent me and get the same result as I am getting.

It is just a problem of rows.

The website has documents in different languages and I am getting one row for each document and each language. But I want to get one row for each document and all the languages within the same row.

Example:

ROW1: YEAR->TITLE1->LANGUAGE1->CONTENT1->LANGUAGE2->CONTENT2->LANGUAGE3->CONTENT3->LANGUAGE4->CONTENT4...

ROW2: YEAR->TITLE2->LANGUAGE1->CONTENT1->LANGUAGE2->CONTENT2->LANGUAGE3->CONTENT3->LANGUAGE4->CONTENT4...

I have no idea how to do that.

You have to make multiple link and text selectors in order to get every language in one row. It would look like something like this:

{"_id":"vatican-va-test-4","startUrl":["http://w2.vatican.va/content/benedict-xvi/pt/homilies/2005.index.html"],"selectors":[{"id":"language-pt","type":"SelectorLink","parentSelectors":["_root"],"selector":"li:nth-of-type(1) a:contains("Português")","multiple":false,"delay":0},{"id":"language-de","type":"SelectorLink","parentSelectors":["language-pt"],"selector":"span.translation a:contains("DE")","multiple":false,"delay":0},{"id":"content-de","type":"SelectorText","parentSelectors":["language-de"],"selector":"div.text.container","multiple":false,"regex":"","delay":0},{"id":"language-en","type":"SelectorLink","parentSelectors":["language-de"],"selector":"span.translation a:contains("EN")","multiple":false,"delay":0},{"id":"content-en","type":"SelectorText","parentSelectors":["language-en"],"selector":"div.text.container","multiple":false,"regex":"","delay":0},{"id":"language-es","type":"SelectorLink","parentSelectors":["language-en"],"selector":"span.translation a:contains("ES")","multiple":false,"delay":0},{"id":"content-es","type":"SelectorText","parentSelectors":["language-es"],"selector":"div.text.container","multiple":false,"regex":"","delay":0},{"id":"language-fr","type":"SelectorLink","parentSelectors":["language-es"],"selector":"span.translation a:contains("FR")","multiple":false,"delay":0},{"id":"content-fr","type":"SelectorText","parentSelectors":["language-fr"],"selector":"div.text.container","multiple":false,"regex":"","delay":0},{"id":"language-it","type":"SelectorLink","parentSelectors":["language-fr"],"selector":"span.translation a:contains("IT")","multiple":false,"delay":0},{"id":"content-it","type":"SelectorText","parentSelectors":["language-it"],"selector":"div.text.container","multiple":false,"regex":"","delay":0},{"id":"content-pt","type":"SelectorText","parentSelectors":["language-pt"],"selector":"div.text.container","multiple":false,"regex":"","delay":0}]}

Note that for this sitemap you need to have Portuguese as a default language.

1 Like

Thank you so much... that is what I was looking for.

It is interesting the commands like "contains("sometext")"... is there a place where I can learn all those commands?

Thanks again

You can refer to CSS selectors reference @ W3Schools: https://www.w3schools.com/cssref/css_selectors.asp