[Corpora-List] Seeking bilingual corpora, colloquial register, gaming

Alex Juan alhelsal at posgrado.upv.es
Mon Aug 6 11:58:39 CEST 2012


Dear all,

I am looking for bilingual/multilingual corpora that could be classified as UGC, that is, user-generated content. This ranges from (but may not be limited to) chat conversations, support forum conversations, phone/sms/email transcripts, etc.

As you know, the language here is not always "standard", and this content may be rich not only in abbreviations but also contain spelling mistakes, and even figurative language and swearwords. If there are also collections or repositories of keywords (aka "seed" words) used in similar studies, that would also be of help. In the first instance, the languages of interest are German and English, with the items of the corpora or repositories aligned with one another.

I am attempting to build an MT prototype of DE<>EN for the gaming domain.

Does anyone know of such a corpus? Any information/orientation will be appreciated (even if it comes from specialists from other HLT fields, such as sentiment analysis or semantic web).

Thanks. -- Alex Juan -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1099 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120806/d98a40c8/attachment.txt>



More information about the Corpora mailing list