[Corpora-List] languages to encodings associations . . .

Tech Monk tekmonk2005 at yahoo.com
Tue Sep 24 16:01:25 CEST 2019


 I have found lists  based on the ISO 639 such as: Codes for the representation of names of languages (Library of Congress)


|
|
|
| | |

|

|
|
| |
Codes for the representation of names of languages (Library of Congress)

This document contains the ISO 639-2 Alpha-3 codes for the representation of names of languages

|

|

|

 but no list gives you a languages to encodings association. For example, even if helpfu towards my goal, this one:

 https://docs.python.org/2/library/codecs.html

 Doesn't really give you a language-encodings association and languages such as the second and fourth most spoken by # of native speakers (Spanish and Hindi) are not listed.

 UTF-8 could be used to encode any language but that is not so with all other encodings.

 Basically what I have in mind is some data looking like:   ISO-639-3|ISO-639-2|ISO-639-1|Name of language|Name of language as java " \uffff" unicode format|all encodings that can be used with that language.  Example, these would be the first 5 fields of three languages:
|tur|tur|tr|Türkçe|\u0054\u00fc\u0072\u006b\u00e7\u0065|
|rus|rus|ru|Русский|\u0420\u0443\u0441\u0441\u043a\u0438\u0439|
|spa|spa|es|Español|\0045\u0073\u0070\u0061\u00f1\u006f\u006c|

 and after that all possible specific encodings used for those languages

 I thought such a list should be easy to find out there.

 Any lists of documentations you would suggest?

 lbrtchx -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7190 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190924/713acea4/attachment.txt>



More information about the Corpora mailing list