[Corpora-List] Most common non-Romance, non-Germanic words in English

Jim Fidelholtz fidelholtz at gmail.com
Tue Apr 8 23:09:33 CEST 2014


Hi, Tristan,

A couple of suggestions and comments. First of all, the figures I have seen, though not totally consistent, suggest that slightly over 60% (the mode seems to be about 63 percent, which I usually round up to 'about two-thirds') of English words (not counting Proper Nouns) is of Romance origin. One assumes that the bulk of the remainder (about 37%) would then be Germanic in origin. I can't remember ever seeing a discussion of this part of English vocabulary, but we know that very many words come from non-Romance languages, so I would guess that maybe less than 10% of English words fit your criteria, at first blush. Nevertheless, you specify that they do not have 'AN origin in any Romance or Germanic language' (my emphasis), which is vague, to say the least. An example: 'chocolate', which according to me surely would find its way into your list, or should, comes from Nauatl, according to the online etymological dictionary via Spanish and other European languages. In English, by default we assume any borrowed word (especially food-related ones) would come from French, which I believe is true in this case. So 'chocolate' clearly has 'an' origin in French (Romance), although its ultimate origin is Nauatl and it should therefore be in your list.

Likewise, you specify that you wish to find 'the most frequent [such] words in English' which is also worse than vague, since theoretically (George Bedell, MIT PhD thesis ca. 1969: nationalizationalizationalize... -- and apparently practically as well: see Baayen 2001) there are an infinite number of words in English; thus, unless you specify a specific number of the most frequent non-Germanic non-Romance English words, there will be an infinite number of them as well (if you think I'm wrong in my count, just wait a few millennia!).

The main point is, you need to specify your parameters more clearly (even then, you will surely have a number of unclear or indeterminate cases).

If you are going to put an upper limit (say N) on the number of such words, as a practical matter your quest should not be so difficult. Find any huge list of English words (alternative: take the largest English corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU site of Mark Davies, make up a list in frequency order, most frequent first, and then eliminate all the obviously Romance words [any word ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones in a similar way. Using prefixes from medical dictionaries, eliminate all the words using them (almost all are from Greek or Latin; Latin is Romance; the Greek ones almost all entered via Latin (medieval university Latin or 'modern' Latin). This will at least shorten your list a great deal. You should get an original corpus from Davies of somewhere between 5 and 10 billion [American sense] words. This might give you as much as 175,000,000 distinct word forms, based on figures from a 5 million-word corpus (The American Heritage corpus, 1971), before your winnowing. Of course, the figure will be much less, since I haven't taken into consideration the geometric decrease in the number of different word forms in larger corpora. In any case, you can see that you will have some work left to do. I don't want to minimize how much work you would have to do, but I think you have already thought up a number of ways to cut down on it and others will surely occur to you, even if you don't find just the kind of corpora to help you that you are looking for. In any case, good luck.

Jim

James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller < miller at ukp.informatik.tu-darmstadt.de> wrote:


> Dear all,
>
> I'm interested in finding the most frequent words in English which do
> not have an origin in any Romance or Germanic language. Does anyone
> know if such a list is available anywhere?
>
> If not, I suppose I could produce one myself easily enough by taking a
> raw frequency list (such as Adam Kilgarriff's BNC lemma counts),
> querying each entry in a machine-readable dictionary which provides
> etymological information, and filtering appropriately. But that
> presupposes that such a dictionary exists. Does anyone know of a
> suitable freely available dictionary for this task? Since I'd need to
> automatically query many thousands of words, I'd want something that I
> can download for offline use and access through an API. I could try
> accessing an offline dump of Wiktionary using the JWKTL API, though I
> suspect Wiktionary's etymological coverage is too spotty.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Research Scientist
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6072 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140408/df6aefc1/attachment.txt>



More information about the Corpora mailing list