[Corpora-List] Most common non-Romance, non-Germanic words in English

Jim Fidelholtz fidelholtz at gmail.com
Thu Apr 10 03:29:50 CEST 2014

Hi, Tristan,

One other suggestion, based on some research I did about 4 decades ago: the dividing line for English between what I call 'familiar' (frequent) words and 'unfamiliar' (infrequent) words seems to be about 5 occurrences per million words of text, on average. Sometimes, for some words, this frequency may drop as low as 1 per million (1/M). I used Thorndike & Lorge (1945 or so) for frequency counts.


James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Wed, Apr 9, 2014 at 8:22 PM, Jim Fidelholtz <fidelholtz at gmail.com> wrote:

> Hi, Tristan,
> OK on the source lg, although it might help if we knew what the purpose of
> your list would be. I didn't have any illusions that you wanted *all* such
> words, although you can see from other comments that there is a large
> number of 'other' languages to pick your loanwords from, and I still
> haven't seen any (even approximate) *number* of such borrowings--but you
> would have to take (starting with the most frequent words) a rather large
> list of words to even begin processing--using the various methods you and
> others have suggested for winnowing down the list.
> Unless, for example, you are interested in doing a similar study for
> borrowings into Spanish, I don't think you can really find *any* English
> borrowings originally from Nauatl that did not come via Spanish originally,
> and this would eliminate really any words from this language, a result
> which I would find unfortunate for any research I can imagine on borrowings
> into English from (various) languages, even eliminating the ones you want
> to eliminate. Btw, another problem you will find in this regard is
> determining reasonably for less frequent (but still among the 'more
> frequent', depending on how you define this) words exactly from what
> language it was taken. Indeed, there are often clues in the phonological
> development in English of the word as to what language it must have come
> from (ie, via) originally, but at least sometimes, for Nauatl, some
> speakers or varieties of English may have been in contact with speakers of
> Nauatl, and the word may have been borrowed independently from more than
> one language and/or at different times. Multilingualism is very complex, on
> the one hand, and etymologists are known to commit (and perpetrate) errors,
> on the other hand. This, of course, includes folk etymology, which is
> rampant in, e. g., place names, among other things.
> Also btw, the point of starting with an extremely large corpus (well over
> a billion words) would be to try to minimize the effect which tends to
> scramble words on the frequency list below, say, the first thousand [note:
> even below about 100 you will find pretty large variation in positions of
> words in the list each and every time you redo a count with new (even
> comparable) data selected], by a fairly large number of positions (this is
> why Carroll et al. factored in very importantly their measure of genre
> distribution, so that among the very *last* words listed by frequency
> (after all but a few tens of real hapax words, among the several tens of
> thousands of hapax) are a few words which occur 2 or more times in the
> whole corpus, but only in the genre [religion] with the fewest selections
> taken in forming the corpus).
> I'd be interested to hear more about your project, in any case.
> Jim
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Benemérita Universidad Autónoma de Puebla, MÉXICO
> On Wed, Apr 9, 2014 at 11:30 AM, Tristan Miller <
> miller at ukp.informatik.tu-darmstadt.de> wrote:
>> Dear Jim,
>> Thanks for your insightful remarks. To address a few matters:
>> On 08/04/14 11:09 PM, Jim Fidelholtz wrote:
>> > Nevertheless, you
>> > specify that they do not have 'AN origin in any Romance or Germanic
>> > language' (my emphasis), which is vague, to say the least. An example:
>> > 'chocolate', which according to me surely would find its way into your
>> > list, or should, comes from Nauatl, according to the online etymological
>> > dictionary via Spanish and other European languages. In English, by
>> > default we assume any borrowed word (especially food-related ones) would
>> > come from French, which I believe is true in this case. So 'chocolate'
>> > clearly has 'an' origin in French (Romance), although its ultimate
>> > origin is Nauatl and it should therefore be in your list.
>> I don't think this part of my phrasing was vague, as you seem to have
>> interpreted it correctly. Yes, I mean to exclude words like "chocolate"
>> which arrived in English via French. If I had wanted them in my list, I
>> might have written something like "words whose earliest post-PIE origin
>> cannot be traced to a Germanic or Romance language".
>> > Likewise, you specify that you wish to find 'the most frequent [such]
>> > words in English' which is also worse than vague, since theoretically
>> > (George Bedell, MIT PhD thesis ca. 1969:
>> > nationalizationalizationalize... -- and apparently practically as well:
>> > see Baayen 2001) there are an infinite number of words in English; thus,
>> > unless you specify a specific number of the most frequent non-Germanic
>> > non-Romance English words, there will be an infinite number of them as
>> > well (if you think I'm wrong in my count, just wait a few millennia!).
>> Well, I thought it would have gone without saying that I didn't want
>> *all* such words -- after all, I made reference in my message to using
>> existing corpora, which must be of finite size. :)
>> > If you are going to put an upper limit (say N) on the number of such
>> > words, as a practical matter your quest should not be so difficult. Find
>> > any huge list of English words (alternative: take the largest English
>> > corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
>> > site of Mark Davies, make up a list in frequency order, most frequent
>> > first, and then eliminate all the obviously Romance words [any word
>> > ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
>> > in a similar way. Using prefixes from medical dictionaries, eliminate
>> > all the words using them (almost all are from Greek or Latin; Latin is
>> > Romance; the Greek ones almost all entered via Latin (medieval
>> > university Latin or 'modern' Latin). This will at least shorten your
>> > list a great deal. You should get an original corpus from Davies of
>> > somewhere between 5 and 10 billion [American sense] words. This might
>> > give you as much as 175,000,000 distinct word forms, based on figures
>> > from a 5 million-word corpus (The American Heritage corpus, 1971),
>> > before your winnowing.
>> I think the more important limiting factor for the list is not the
>> number of words in the corpus, but rather the number of words in the
>> etymological MRD. That is, assuming the frequency counts are already
>> available, there's no need to heuristically exclude Romance and Germanic
>> words (and indeed, I don't think I'd want to, as in my experience you
>> get too many false positives). In the first instance we can simply
>> filter out all words which don't appear in the dictionary, and then look
>> up the remainder. In the worst case this will involve looking up every
>> word in the dictionary once, which, if done automatically, can't take
>> more than a few hours or days of computing time. The problem is finding
>> such a dictionary and an API therefor. I've got an offline copy of the
>> OED2, though I don't know if it's possible to query via API, or how easy
>> it would be to parse the etymological information.
>> Regards,
>> Tristan
>> --
>> Tristan Miller, Research Scientist
>> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
>> Department of Computer Science, Technische Universität Darmstadt
>> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9876 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140409/d3ebc118/attachment.txt>

More information about the Corpora mailing list