The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus has also been pre-packaged as bi-texts for each language pair (between 16 and 25 million sentences per language pair).
A subset of the corpus (ca. 11.5 million sentences) is available as a six-language fully-parallel corpus, i.e. all sentences have equivalents in all six languages. Data from 2015 has been used to create official development sets and test sets, also fully aligned across the six official UN languages. The paper reports SMT baselines for all languages pairs for this corpus.
The corpus can be downloaded from:
http://conferences.unite.un.org/UNCorpus
The corresponding publication has been published at LREC 2015:
http://www.lrec-conf.org/proceedings/lrec2016/pdf/1195_Paper.pdf
While registering, please leave a short description of the work for which you plan to use the corpus. In the near future we plan to set up a section with references to papers that describe research done with UN corpus. Feel free to share links and bibliography items with us (either with me or any of the authors of the above paper).
Marcin Junczys-Dowmunt