[Corpora-List] Most common non-Romance, non-Germanic words in English

Jim Fidelholtz fidelholtz at gmail.com
Thu Apr 10 03:22:31 CEST 2014

Hi, Tristan,

OK on the source lg, although it might help if we knew what the purpose of your list would be. I didn't have any illusions that you wanted *all* such words, although you can see from other comments that there is a large number of 'other' languages to pick your loanwords from, and I still haven't seen any (even approximate) *number* of such borrowings--but you would have to take (starting with the most frequent words) a rather large list of words to even begin processing--using the various methods you and others have suggested for winnowing down the list.

Unless, for example, you are interested in doing a similar study for borrowings into Spanish, I don't think you can really find *any* English borrowings originally from Nauatl that did not come via Spanish originally, and this would eliminate really any words from this language, a result which I would find unfortunate for any research I can imagine on borrowings into English from (various) languages, even eliminating the ones you want to eliminate. Btw, another problem you will find in this regard is determining reasonably for less frequent (but still among the 'more frequent', depending on how you define this) words exactly from what language it was taken. Indeed, there are often clues in the phonological development in English of the word as to what language it must have come from (ie, via) originally, but at least sometimes, for Nauatl, some speakers or varieties of English may have been in contact with speakers of Nauatl, and the word may have been borrowed independently from more than one language and/or at different times. Multilingualism is very complex, on the one hand, and etymologists are known to commit (and perpetrate) errors, on the other hand. This, of course, includes folk etymology, which is rampant in, e. g., place names, among other things.

Also btw, the point of starting with an extremely large corpus (well over a billion words) would be to try to minimize the effect which tends to scramble words on the frequency list below, say, the first thousand [note: even below about 100 you will find pretty large variation in positions of words in the list each and every time you redo a count with new (even comparable) data selected], by a fairly large number of positions (this is why Carroll et al. factored in very importantly their measure of genre distribution, so that among the very *last* words listed by frequency (after all but a few tens of real hapax words, among the several tens of thousands of hapax) are a few words which occur 2 or more times in the whole corpus, but only in the genre [religion] with the fewest selections taken in forming the corpus).

I'd be interested to hear more about your project, in any case.


James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Wed, Apr 9, 2014 at 11:30 AM, Tristan Miller < miller at ukp.informatik.tu-darmstadt.de> wrote:

> Dear Jim,
> Thanks for your insightful remarks. To address a few matters:
> On 08/04/14 11:09 PM, Jim Fidelholtz wrote:
> > Nevertheless, you
> > specify that they do not have 'AN origin in any Romance or Germanic
> > language' (my emphasis), which is vague, to say the least. An example:
> > 'chocolate', which according to me surely would find its way into your
> > list, or should, comes from Nauatl, according to the online etymological
> > dictionary via Spanish and other European languages. In English, by
> > default we assume any borrowed word (especially food-related ones) would
> > come from French, which I believe is true in this case. So 'chocolate'
> > clearly has 'an' origin in French (Romance), although its ultimate
> > origin is Nauatl and it should therefore be in your list.
> I don't think this part of my phrasing was vague, as you seem to have
> interpreted it correctly. Yes, I mean to exclude words like "chocolate"
> which arrived in English via French. If I had wanted them in my list, I
> might have written something like "words whose earliest post-PIE origin
> cannot be traced to a Germanic or Romance language".
> > Likewise, you specify that you wish to find 'the most frequent [such]
> > words in English' which is also worse than vague, since theoretically
> > (George Bedell, MIT PhD thesis ca. 1969:
> > nationalizationalizationalize... -- and apparently practically as well:
> > see Baayen 2001) there are an infinite number of words in English; thus,
> > unless you specify a specific number of the most frequent non-Germanic
> > non-Romance English words, there will be an infinite number of them as
> > well (if you think I'm wrong in my count, just wait a few millennia!).
> Well, I thought it would have gone without saying that I didn't want
> *all* such words -- after all, I made reference in my message to using
> existing corpora, which must be of finite size. :)
> > If you are going to put an upper limit (say N) on the number of such
> > words, as a practical matter your quest should not be so difficult. Find
> > any huge list of English words (alternative: take the largest English
> > corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
> > site of Mark Davies, make up a list in frequency order, most frequent
> > first, and then eliminate all the obviously Romance words [any word
> > ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
> > in a similar way. Using prefixes from medical dictionaries, eliminate
> > all the words using them (almost all are from Greek or Latin; Latin is
> > Romance; the Greek ones almost all entered via Latin (medieval
> > university Latin or 'modern' Latin). This will at least shorten your
> > list a great deal. You should get an original corpus from Davies of
> > somewhere between 5 and 10 billion [American sense] words. This might
> > give you as much as 175,000,000 distinct word forms, based on figures
> > from a 5 million-word corpus (The American Heritage corpus, 1971),
> > before your winnowing.
> I think the more important limiting factor for the list is not the
> number of words in the corpus, but rather the number of words in the
> etymological MRD. That is, assuming the frequency counts are already
> available, there's no need to heuristically exclude Romance and Germanic
> words (and indeed, I don't think I'd want to, as in my experience you
> get too many false positives). In the first instance we can simply
> filter out all words which don't appear in the dictionary, and then look
> up the remainder. In the worst case this will involve looking up every
> word in the dictionary once, which, if done automatically, can't take
> more than a few hours or days of computing time. The problem is finding
> such a dictionary and an API therefor. I've got an offline copy of the
> OED2, though I don't know if it's possible to query via API, or how easy
> it would be to parse the etymological information.
> Regards,
> Tristan
> --
> Tristan Miller, Research Scientist
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8694 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140409/1fa2f63e/attachment.txt>

More information about the Corpora mailing list