1) films are often dubbed so exist in parallel languages 2) in the age of Web 2.0, people write transcrips and upload them 3) time stamps support alignment
see http://urd.let.rug.nl/tiedeman/OPUS/ - there's lots of (quasi-spoken) data for lots of language-pairs
2008/2/27 Alexandre Rafalovitch <arafalov at gmail.com>:
> Official documents of the United Nations are translated (by human
> translators) into 6 languages (English, French, Spanish, Russian,
> Chinese, Arabic). They are not unfortunately available in a research
> ready bitexts, but the documents themselves are available from
> http://documents.un.org . There is quite a lot of text there, if one
> is ready to do some non-traditional parsing.
> For the last 8-10 years, most of those documents have been available
> in MSWord format. The rest are in PDFs (some with text and some with
> scanned images).
> LDC had a very old sample of UN documents; I think that was before
> MSWord versions started to be published, so they had to scan and clean
> their data.
> I have more information available, if somebody takes an interest. I am
> doing research in Named Entity Recognition in that domain, but there
> are enough challenges in the corpora to go around.
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
> On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> > Dear colleagues,
> > Is anyone aware of attempts to estimate how much machine-readable
> > parallel text is publicly available? I'm trying to get a general
> > sense of the scale of parallel data we currently have (and are likely
> > to have in the future, assuming current growth trends). Does anyone
> > have any statistics on this sort of thing?
> Corpora mailing list
> Corpora at uib.no
-- ================================================ Adam Kilgarriff http://www.kilgarriff.co.uk Lexical Computing Ltd http://www.sketchengine.co.uk Lexicography MasterClass Ltd http://www.lexmasterclass.com Universities of Leeds and Sussex adam at lexmasterclass.com ================================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080227/34387bf6/attachment.html