[Corpora-List] Most common non-Romance, non-Germanic words in English

Tristan Miller miller at ukp.informatik.tu-darmstadt.de
Wed Apr 9 18:30:07 CEST 2014

Dear Jim,

Thanks for your insightful remarks. To address a few matters:

On 08/04/14 11:09 PM, Jim Fidelholtz wrote:
> Nevertheless, you
> specify that they do not have 'AN origin in any Romance or Germanic
> language' (my emphasis), which is vague, to say the least. An example:
> 'chocolate', which according to me surely would find its way into your
> list, or should, comes from Nauatl, according to the online etymological
> dictionary via Spanish and other European languages. In English, by
> default we assume any borrowed word (especially food-related ones) would
> come from French, which I believe is true in this case. So 'chocolate'
> clearly has 'an' origin in French (Romance), although its ultimate
> origin is Nauatl and it should therefore be in your list.

I don't think this part of my phrasing was vague, as you seem to have interpreted it correctly. Yes, I mean to exclude words like "chocolate" which arrived in English via French. If I had wanted them in my list, I might have written something like "words whose earliest post-PIE origin cannot be traced to a Germanic or Romance language".

> Likewise, you specify that you wish to find 'the most frequent [such]
> words in English' which is also worse than vague, since theoretically
> (George Bedell, MIT PhD thesis ca. 1969:
> nationalizationalizationalize... -- and apparently practically as well:
> see Baayen 2001) there are an infinite number of words in English; thus,
> unless you specify a specific number of the most frequent non-Germanic
> non-Romance English words, there will be an infinite number of them as
> well (if you think I'm wrong in my count, just wait a few millennia!).

Well, I thought it would have gone without saying that I didn't want *all* such words -- after all, I made reference in my message to using existing corpora, which must be of finite size. :)

> If you are going to put an upper limit (say N) on the number of such
> words, as a practical matter your quest should not be so difficult. Find
> any huge list of English words (alternative: take the largest English
> corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
> site of Mark Davies, make up a list in frequency order, most frequent
> first, and then eliminate all the obviously Romance words [any word
> ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
> in a similar way. Using prefixes from medical dictionaries, eliminate
> all the words using them (almost all are from Greek or Latin; Latin is
> Romance; the Greek ones almost all entered via Latin (medieval
> university Latin or 'modern' Latin). This will at least shorten your
> list a great deal. You should get an original corpus from Davies of
> somewhere between 5 and 10 billion [American sense] words. This might
> give you as much as 175,000,000 distinct word forms, based on figures
> from a 5 million-word corpus (The American Heritage corpus, 1971),
> before your winnowing.

I think the more important limiting factor for the list is not the number of words in the corpus, but rather the number of words in the etymological MRD. That is, assuming the frequency counts are already available, there's no need to heuristically exclude Romance and Germanic words (and indeed, I don't think I'd want to, as in my experience you get too many false positives). In the first instance we can simply filter out all words which don't appear in the dictionary, and then look up the remainder. In the worst case this will involve looking up every word in the dictionary once, which, if done automatically, can't take more than a few hours or days of computing time. The problem is finding such a dictionary and an API therefor. I've got an offline copy of the OED2, though I don't know if it's possible to query via API, or how easy it would be to parse the etymological information.

Regards, Tristan

