[Corpora-List] Seeking bilingual corpora, colloquial register, gaming

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Mon Aug 6 15:11:23 CEST 2012

Maybe translated movie subtitles would fit your needs: http://opus.lingfil.uu.se/OpenSubtitles_v2.php There is plenty of dialogues, swear words, abbreviations and even spelling mistakes (but mostly coming from OCR) in the data collection.


On Mon, Aug 6, 2012 at 11:58 AM, Alex Juan <alhelsal at posgrado.upv.es> wrote:
> Dear all,
> I am looking for bilingual/multilingual corpora that could be classified as
> UGC, that is, user-generated content. This ranges from (but may not be
> limited to) chat conversations, support forum conversations, phone/sms/email
> transcripts, etc.
> As you know, the language here is not always "standard", and this content
> may be rich not only in abbreviations but also contain spelling mistakes,
> and even figurative language and swearwords. If there are also collections
> or repositories of keywords (aka "seed" words) used in similar studies, that
> would also be of help. In the first instance, the languages of interest are
> German and English, with the items of the corpora or repositories aligned
> with one another.
> I am attempting to build an MT prototype of DE<>EN for the gaming domain.
> Does anyone know of such a corpus? Any information/orientation will be
> appreciated (even if it comes from specialists from other HLT fields, such
> as sentiment analysis or semantic web).
> Thanks.
> --
> Alex Juan
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- **********************************************************************************

Jörg Tiedemann jorg.tiedemann at lingfil.uu.se

Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/

Uppsala University tel: +46 (0)18 - 471 1412

Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094

More information about the Corpora mailing list