For the last 8-10 years, most of those documents have been available in MSWord format. The rest are in PDFs (some with text and some with scanned images).
LDC had a very old sample of UN documents; I think that was before MSWord versions started to be published, so they had to scan and clean their data.
I have more information available, if somebody takes an interest. I am doing research in Named Entity Recognition in that domain, but there are enough challenges in the corpora to go around.
-- Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/
On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> Dear colleagues,
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available? I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends). Does anyone
> have any statistics on this sort of thing?