[Corpora-List] Most common non-Romance, non-Germanic words in English

Jim Fidelholtz fidelholtz at gmail.com
Thu Apr 10 02:24:11 CEST 2014

Hi, Erin,

Sorry about the brevity. I was referring to Harald Baayen's book _Word frequency distributions_ from 2001, published by Dordrecht, Netherlands: Kluwer. I am copying this to the list, in case others were similarly mystified by the reference.


PS: btw, the book was quite understandable (esp. if you have a mathematical background) and, I found, also quite provocative. (I have made earlier comments [within a few years of its being publshed] on this list.)

James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <erin at wordnik.com> wrote:

> Dear Jim,
> I found this email fascinating, but I wonder if you might have a more
> explicit citation for Baayen 2001? Did you mean
> Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in morphology:
> modeling the choice of linking morphemes in Dutch, Linguistics 39, 51-93.
> It's the only paper of that date on his publications page:
> http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html
> Any direction gratefully received!
> Yours,
> Erin
> On 4/8/14 2:09 PM, Jim Fidelholtz wrote:
>> Hi, Tristan,
>> A couple of suggestions and comments. First of all, the figures I have
>> seen, though not totally consistent, suggest that slightly over 60% (the
>> mode seems to be about 63 percent, �which I usually round up to 'about
>> two-thirds') of English words (not counting Proper Nouns) is of Romance
>> origin. One assumes that the bulk of the remainder (about 37%) would
>> then be Germanic in origin. I can't remember ever seeing a discussion of
>> this part of English vocabulary, but we know that very many words come
>> from non-Romance languages, so I would guess that maybe less than 10% of
>> English words fit your criteria, at first blush. Nevertheless, you
>> specify that they do not have 'AN origin in any Romance or Germanic
>> language' (my emphasis), which is vague, to say the least. An example:
>> 'chocolate', which according to me surely would find its way into your
>> list, or should, comes from Nauatl, according to the online etymological
>> dictionary via Spanish and other European languages. In English, by
>> default we assume any borrowed word (especially food-related ones) would
>> come from French, which I believe is true in this case. So 'chocolate'
>> clearly has 'an' origin in French (Romance), although its ultimate
>> origin is Nauatl and it should therefore be in your list.�
>> Likewise, you specify that you wish to find 'the most frequent [such]
>> words in English' which is also worse than vague, since theoretically
>> (George Bedell, MIT PhD thesis ca. 1969:
>> nationalizationalizationalize... -- and apparently practically as well:
>> see Baayen 2001) there are an infinite number of words in English; thus,
>> unless you specify a specific number of the most frequent non-Germanic
>> non-Romance English words, there will be an infinite number of them as
>> well (if you think I'm wrong in my count, just wait a few millennia!).
>> The main point is, you need to specify your parameters more clearly
>> (even then, you will surely have a number of unclear or indeterminate
>> cases).
>> If you are going to put an upper limit (say N) on the number of such
>> words, as a practical matter your quest should not be so difficult. Find
>> any huge list of English words (alternative: take the largest English
>> corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
>> site of Mark Davies, make up a list in frequency order, most frequent
>> first, and then eliminate all the obviously Romance words [any word
>> ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
>> in a similar way. Using prefixes from medical dictionaries, eliminate
>> all the words using them (almost all are from Greek or Latin; Latin is
>> Romance; the Greek ones almost all entered via Latin (medieval
>> university Latin or 'modern' Latin). This will at least shorten your
>> list a great deal. You should get an original corpus from Davies of
>> somewhere between 5 and 10 billion [American sense] words. This might
>> give you as much as 175,000,000 distinct word forms, based on figures
>> from a 5 million-word corpus (The American Heritage corpus, 1971),
>> before your winnowing. Of course, the figure will be much less, since I
>> haven't taken into consideration the geometric decrease in the number of
>> different word forms in larger corpora. In any case, you can see that
>> you will have some work left to do. I don't want to minimize how much
>> work you would have to do, but I think you have already thought up a
>> number of ways to cut down on it and others will surely occur to you,
>> even if you don't find just the kind of corpora to help you that you are
>> looking for. In any case, good luck.
>> Jim
>> James L. Fidelholtz
>> Posgrado en Ciencias del Lenguaje
>> Instituto de Ciencias Sociales y Humanidades
>> Benem�rita Universidad Aut�noma de Puebla, M�XICO
>> On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller
>> <miller at ukp.informatik.tu-darmstadt.de
>> <mailto:miller at ukp.informatik.tu-darmstadt.de>> wrote:
>> Dear all,
>> I'm interested in finding the most frequent words in English which do
>> not have an origin in any Romance or Germanic language. �Does anyone
>> know if such a list is available anywhere?
>> If not, I suppose I could produce one myself easily enough by taking a
>> raw frequency list (such as Adam Kilgarriff's BNC lemma counts),
>> querying each entry in a machine-readable dictionary which provides
>> etymological information, and filtering appropriately. �But that
>> presupposes that such a dictionary exists. �Does anyone know of a
>> suitable freely available dictionary for this task? �Since I'd need to
>> automatically query many thousands of words, I'd want something that I
>> can download for offline use and access through an API. �I could try
>> accessing an offline dump of Wiktionary using the JWKTL API, though I
>> suspect Wiktionary's etymological coverage is too spotty.
>> Regards,
>> Tristan
>> --
>> Tristan Miller, Research Scientist
>> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
>> Department of Computer Science, Technische Universit�t Darmstadt
>> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no <mailto:Corpora at uib.no>
>> http://mailman.uib.no/listinfo/corpora
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8812 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140409/43bc5145/attachment.txt>

More information about the Corpora mailing list