[Corpora-List] quantities of publicly available parallel text?

Mike Maxwell maxwell at umiacs.umd.edu
Wed Feb 27 05:07:20 CET 2008


Chris Dyer wrote:
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available? I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends). Does anyone
> have any statistics on this sort of thing?

I've tried to come up with these figures several times, with emphasis on other than "high density languages". One set of figures (from my talk at the ACL 2005 Workshop on Building and Using Parallel Texts) of what you could expect to find in the way of bitext at a minimum is the following: ------------ If the language is written, the New Testament (140k words (tokens) in Greek) For languages which have the complete Bible (OT and NT): ~770k words (tokens; ~30k types in English) Other common sources: Declaration of Human Rights (~1800 words in English) ------------

There are no solid figures on how many languages are written, but I'd guess in the neighborhood of 1500 (out of the roughly 7000 languages in the world). Of course, not all of those texts are in electronic form, but it wouldn't take a large effort to key them in. (I would guess that OCR is probably not reliable enough.)

While we were at the LDC, around 2004, Bill Poser and I did a survey of resources for LoDLs, specifically for all languages with at least a million speakers (according to the Ethnologue), but leaving out most of the European languages, as well as Japanese, Mandarin Chinese, Modern Standard Arabic, and Korean. Our goal was not to find out how much of each resource was available, but to see which languages had at least a certain minimum level of resources. For bilingual text, the minimum level was 100k words in electronic form, either in a corpus or estimated to be available if you scrounged around on the internet. Of the 300 or so languages with a million speakers, we got through around 150 before we ran out of time. I don't have the figures right now (and more importantly, they're badly out of date), but I think we came up with less than 30 languages (maybe a lot less) that had that amount of parallel text. Of course that leaves out the high density languages, so you could add another 20 or 30, and I suspect the number is substantially higher now. There are some surprises--Basque and Inuit, for example, have substantial amounts of parallel text.

I suspect translation houses own a fair amount of bitext, but for various reasons can't release it. (I don't know what genre it is.)

Later, when we worked on a project to create a set of resources for LoDLs at the LDC, the scarcity of bilingual text "in the wild" made us decide to create our bitext by contracting out to translation agencies for most of the languages. The languages in question were Hungarian (for which substantial bitext already existed), Uzbek, Bengali (= Bangla), Urdu, Tigrinya, Yoruba (which hardly had any electronic text, much less bitext in electronic form), and Tagalog (the Communist Party of the Philippines had thoughtfully provided bitext for this and a couple other Philippine languages, although I've heard that the translations were a bit stilted). Some other languages were added later, and NMSU did several too.

As for the trends, I think the short answer is "translation is expensive; who will pay for it?" Wrt that, Mark Davis has an interesting graph of GDP by language in Unicode Technical Note #13: http://www.unicode.org/notes/tn13). It's not very encouraging. And while there is a noticeable increase in computational resources for many languages in the years since Bill and I looked at this, the standard has also gotten a lot higher. 100k words is probably way too low as a threshold, for example. --

Mike Maxwell

What good is a universe without somebody around to look at it?

--Robert Dicke, Princeton physicist



More information about the Corpora mailing list