[Corpora-List] New parallel corpus release: OpenSubtitles2016

Jörg Tiedemann Jorg.Tiedemann at lingfil.uu.se
Tue Mar 15 19:46:50 CET 2016


We just released a major update of the parallel subtitle corpus in OPUS: http://opus.lingfil.uu.se/OpenSubtitles2016.php

2.8 million subtitle files in 60 languages with a total of over 17 billion tokens in 2.6 billion sentences and sentence fragments. As usual in OPUS all languages are sentence-aligned creating a total of 1,689 bitexts. The data sets are provided in standalone XML format with standoff sentence alignment, TMX and aligned plain text format (often used in training SMT models).

More information is available in: Pierre Lison and Jörg Tiedemann, 2012, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf

In addition, we also provide intra-lingual alignments between alternative subtitles in the same language: http://opus.lingfil.uu.se/OpenSubtitles2016alt.php

More information about those alignments and how they are sorted into various categories can be found in: Jörg Tiedemann, 2012, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/lrec2016.pdf

Note, that all data sets are automatically created using various pre-processing and alignment tools. There will be problems at various levels. Feedback is very welcome!

Other new data sets in OPUS:

News Commentary version 11 (originally provided by CASMACAT): http://opus.lingfil.uu.se/News-Commentary11.php Different to the original source, this release is truly multilingual with alignments across all languages.

Global Voices (also provided by CASMACAT): http://opus.lingfil.uu.se/GlobalVoices.php Again, this version is multilingual.

Wikipedia: A corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek. More information: Krzysztof Wołk and Krzysztof Marasek: Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs., Procedia Technology, 18, Elsevier, p.126-132, 2014 http://www.sciencedirect.com/science/article/pii/S2212017314005453

For more information on OPUS: http://opus.lingfil.uu.se/index.php Select the language pair you are interested in to see all resources that are available for that particular language pair. Data formats are explained here: http://opus.lingfil.uu.se/trac


********************************************************************************** Jörg Tiedemann Department of Modern Languages http://www.helsinki.fi/~tiedeman/ University of Helsinki

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5823 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160315/19d45261/attachment.txt>

More information about the Corpora mailing list