[Corpora-List] (no subject)

maxwell maxwell at umiacs.umd.edu
Wed Oct 19 21:26:23 CEST 2011


On Wed, 19 Oct 2011 17:56:27 +0100 (BST), Abu Fahad <salehosaimi at yahoo.com> wrote:
> Are you aware of any frequency list of Arabic?

Let me up the generality a bit, since "frequency lists of language X" seems to be a common request.

I would be surprised if there weren't frequency lists for many languages. Obviously there are questions (stemmed or not? what does it mean to have a "balanced" corpus from which to derive such lists?), but such lists probably have at least some utility. Is there a place with links to frequency lists of multiple languages?

There are lists of "correctly" spelled words for some languages, which are sometimes grouped into top 10k words, top 20k etc. I suppose one could derive a very coarse-grained ranking from such lists. Obviously these would be inflected words, not stemmed words.

I looked at the ACL wiki (http://aclweb.org/aclwiki/index.php?title=Main_Page), but nothing jumped out at me. So I'll prime the pump with a few links to such lists I did find:

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

http://invokeit.wordpress.com/frequency-word-lists/

(there's a link to a link to an Arabic list here)

http://borel.slu.edu/crubadan/index.html

("...send me an email if you're interested in

a particular language and there's plenty of

data I am free to share (frequency lists...")

Mike Maxwell



More information about the Corpora mailing list