[Corpora-List] quantities of publicly available parallel text?

Alexandre Rafalovitch arafalov at gmail.com
Wed Feb 27 15:47:30 CET 2008


Official documents of the United Nations are translated (by human translators) into 6 languages (English, French, Spanish, Russian, Chinese, Arabic). They are not unfortunately available in a research ready bitexts, but the documents themselves are available from http://documents.un.org . There is quite a lot of text there, if one is ready to do some non-traditional parsing.

For the last 8-10 years, most of those documents have been available in MSWord format. The rest are in PDFs (some with text and some with scanned images).

LDC had a very old sample of UN documents; I think that was before MSWord versions started to be published, so they had to scan and clean their data.

I have more information available, if somebody takes an interest. I am doing research in Named Entity Recognition in that domain, but there are enough challenges in the corpora to go around.

Regards,

Alex.

-- Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/

On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> Dear colleagues,
>
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available? I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends). Does anyone
> have any statistics on this sort of thing?



More information about the Corpora mailing list