[Corpora-List] Most common non-Romance, non-Germanic words in English

Erin McKean erin at logocracy.com
Thu Apr 10 16:09:22 CEST 2014


Dear Jim,

Thank you VERY much!

Yours,

Erin

On 4/9/14 5:24 PM, Jim Fidelholtz wrote:
> Hi, Erin,
>
> Sorry about the brevity. I was referring to Harald Baayen's book _Word
> frequency distributions_ from 2001, published by Dordrecht, Netherlands:
> Kluwer. I am copying this to the list, in case others were similarly
> mystified by the reference.
>
> Jim
>
> PS: btw, the book was quite understandable (esp. if you have a
> mathematical background) and, I found, also quite provocative. (I have
> made earlier comments [within a few years of its being publshed] on this
> list.)
>
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Benemérita Universidad Autónoma de Puebla, MÉXICO
>
>
> On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <erin at wordnik.com
> <mailto:erin at wordnik.com>> wrote:
>
> Dear Jim,
>
> I found this email fascinating, but I wonder if you might have a
> more explicit citation for Baayen 2001? Did you mean
>
> Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in
> morphology: modeling the choice of linking morphemes in Dutch,
> Linguistics 39, 51-93.
>
> It's the only paper of that date on his publications page:
> http://www.sfs.uni-tuebingen.__de/~hbaayen/publications.html
> <http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html>
>
> Any direction gratefully received!
>
> Yours,
>
> Erin
>
>
> On 4/8/14 2:09 PM, Jim Fidelholtz wrote:
>
> Hi, Tristan,
>
> A couple of suggestions and comments. First of all, the figures
> I have
> seen, though not totally consistent, suggest that slightly over
> 60% (the
> mode seems to be about 63 percent, �which I usually round up to
> 'about
> two-thirds') of English words (not counting Proper Nouns) is of
> Romance
> origin. One assumes that the bulk of the remainder (about 37%) would
> then be Germanic in origin. I can't remember ever seeing a
> discussion of
> this part of English vocabulary, but we know that very many
> words come
> from non-Romance languages, so I would guess that maybe less
> than 10% of
> English words fit your criteria, at first blush. Nevertheless, you
> specify that they do not have 'AN origin in any Romance or Germanic
> language' (my emphasis), which is vague, to say the least. An
> example:
> 'chocolate', which according to me surely would find its way
> into your
> list, or should, comes from Nauatl, according to the online
> etymological
> dictionary via Spanish and other European languages. In English, by
> default we assume any borrowed word (especially food-related
> ones) would
> come from French, which I believe is true in this case. So
> 'chocolate'
> clearly has 'an' origin in French (Romance), although its ultimate
> origin is Nauatl and it should therefore be in your list.�
>
> Likewise, you specify that you wish to find 'the most frequent
> [such]
> words in English' which is also worse than vague, since
> theoretically
> (George Bedell, MIT PhD thesis ca. 1969:
> nationalizationalizationalize.__.. -- and apparently practically
> as well:
> see Baayen 2001) there are an infinite number of words in
> English; thus,
> unless you specify a specific number of the most frequent
> non-Germanic
> non-Romance English words, there will be an infinite number of
> them as
> well (if you think I'm wrong in my count, just wait a few
> millennia!).
>
> The main point is, you need to specify your parameters more clearly
> (even then, you will surely have a number of unclear or
> indeterminate
> cases).
>
> If you are going to put an upper limit (say N) on the number of such
> words, as a practical matter your quest should not be so
> difficult. Find
> any huge list of English words (alternative: take the largest
> English
> corpus, eg, combine all of the Englsh corpora [COCA, etc.] on
> the BYU
> site of Mark Davies, make up a list in frequency order, most
> frequent
> first, and then eliminate all the obviously Romance words [any word
> ending in -tion, -nce, etc.]; then eliminate the obviously
> Germanic ones
> in a similar way. Using prefixes from medical dictionaries,
> eliminate
> all the words using them (almost all are from Greek or Latin;
> Latin is
> Romance; the Greek ones almost all entered via Latin (medieval
> university Latin or 'modern' Latin). This will at least shorten your
> list a great deal. You should get an original corpus from Davies of
> somewhere between 5 and 10 billion [American sense] words. This
> might
> give you as much as 175,000,000 distinct word forms, based on
> figures
> from a 5 million-word corpus (The American Heritage corpus, 1971),
> before your winnowing. Of course, the figure will be much less,
> since I
> haven't taken into consideration the geometric decrease in the
> number of
> different word forms in larger corpora. In any case, you can see
> that
> you will have some work left to do. I don't want to minimize how
> much
> work you would have to do, but I think you have already thought up a
> number of ways to cut down on it and others will surely occur to
> you,
> even if you don't find just the kind of corpora to help you that
> you are
> looking for. In any case, good luck.
>
> Jim
>
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Benem�rita Universidad Aut�noma de Puebla, M�XICO
>
>
>
> On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller
> <miller at ukp.informatik.tu-__darmstadt.de
> <mailto:miller at ukp.informatik.tu-darmstadt.de>
> <mailto:miller at ukp.informatik.__tu-darmstadt.de
> <mailto:miller at ukp.informatik.tu-darmstadt.de>>> wrote:
>
> Dear all,
>
> I'm interested in finding the most frequent words in
> English which do
> not have an origin in any Romance or Germanic language.
> �Does anyone
> know if such a list is available anywhere?
>
> If not, I suppose I could produce one myself easily enough
> by taking a
> raw frequency list (such as Adam Kilgarriff's BNC lemma
> counts),
> querying each entry in a machine-readable dictionary which
> provides
> etymological information, and filtering appropriately. �But
> that
> presupposes that such a dictionary exists. �Does anyone
> know of a
> suitable freely available dictionary for this task? �Since
> I'd need to
> automatically query many thousands of words, I'd want
> something that I
> can download for offline use and access through an API. �I
> could try
> accessing an offline dump of Wiktionary using the JWKTL
> API, though I
> suspect Wiktionary's etymological coverage is too spotty.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Research Scientist
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universit�t
> Darmstadt
>
> Tel: +49 6151 16 6166 | Web:
> http://www.ukp.tu-darmstadt.__de/ <http://www.ukp.tu-darmstadt.de/>
>
>
> _________________________________________________
> UNSUBSCRIBE from this page:
> http://mailman.uib.no/options/__corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no> <mailto:Corpora at uib.no
> <mailto:Corpora at uib.no>>
> http://mailman.uib.no/__listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
>
> _________________________________________________
> UNSUBSCRIBE from this page:
> http://mailman.uib.no/options/__corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/__listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



More information about the Corpora mailing list