[Corpora-List] The story so far on Ngram

John Mckenny john.mckenny at unn.ac.uk
Mon Mar 14 10:27:01 CET 2005


Dear Corpusians

I summarize because some opinions may not have been posted to the list.
Apologies for not having time to make this shorter and to Pascal for
stealing such a good line.
A combination of empirical and rationalist approaches points to
N-gram/n-gram (with a hyphen) being more acceptable and more used than the
hyphenless ngram/Ngram (personally ngram looks neater to me but I won't even
think about it in the future).
John Sowa writes
<I treat a variable such as N or a number such as 4 in the same way I would
treat a word. Therefore, I would apply the same rules for inserting a hyphen
between two words. If the variable is N, I would write N-gram. But if the
variable were x, y, or n, I would write x-gram, y-gram, or n-gram. And by
the same rule, I would write 4-gram.>

< Harold Somers writes :As editor of a journal which often has articles that
mention n-grams, my house style is to have n-gram with a hyphen, and the n
in italics. Although I feel it is not quite right, I guess I would
capitalize the n if it starts a sentence. As for nomenclature, it seems to
me that we hear about unigrams, bigrams and trigrams, but after that use
numbers: 4-grams, 5-grams etc., with a numeral and a hyphen. That's my
preference, based on what I have seen or heard>.

Noah Smith writes: <Not sure on hyphenation, but in my view the "N" or "n"
is an algebraic variable and should be in italics/math typeface. There's a
paper by Kneser and Ney in which they actually call them "m-grams"! "N" or
"n" is arguably just the conventional choice of the variable's name, like
lambda and mu for Lagrangean multipliers, alpha for interpolation
coefficients, etc. As for higher-order -grams, some tend to avoid the
vocabulary question by referring to (for example) 4-gram models as
third-order Markov models (generally a p-gram model is a (p-1)th-order
Markov model). If you get an empirical result that supports a consensus,
maybe we won't have to resort to this workaround!
Chris Brew writes:
<The sequence could have been monogram, digram, trigram, tetragram,
pentagram, hexagram, ...with fairly uniform (Greek) etymology, but someone
chose unigram,bigram,trigram,...these look like Latin numerical prefixes, so
my guess is that the intended extrapolation is
quadrigram,quintagram,....which replicates the mixed Latin/Greek etymology
of bigram through the series. Pretty yukky...
Geoffrey Sampson writes
<Well if it's pentagrams and hexagrams it surely should be tetragrams rather
than "quadrigrams", in order to avoid mixing Latin and Greek.
But then if you want to avoid pentagram because of Satanism, you might
equally want to avoid tetragram because it might be taken to refer to the
unspeakable Hebrew four-letter name of God. You can't win!
I think most people would write 4-gram, 5-gram etc after "trigram", and
whether you capitalize the N of N-gram must surely be a matter of taste
only. (Though missing out the hyphen would be confusing, I'd have
thought.)>

On the more empirical side, Damon Allen Davison writes <My corpus was a page
of Google results limited to 100 for the search term "n gram". Doing both
"ngram" and "n gram" was slightly problematic because their is a Perl CPAN
module called Text::Ngram, so that weights the results for "ngram" quite a
bit.

n-gram : 128 times
N-gram : 126 times
ngram : 57 times
N-Gram : 34 times
Ngram : 10 times
N-GRAM : 9 times
NGRAM : 8 times
n-Gram : 7 times
NGram : 5 times
I did this using this Perl script after doing "links --dump
results.html > results.txt" to the results file I had saved.
#!/usr/bin/perl
# syntax: findword <filename>
use warnings;
use strict;
my %total;
my _AT_matches;
while ( <> ) {
_AT_matches = /(n-?gram)/i; # case-insensitive, case-preserving
matching, dash optional
$total{$_}++ foreach _AT_matches;
}
print map { "$_ : $total{$_} times\n" } reverse sort { $total{$a} <=>
$total{$b} } keys %total;
Anyway, I hope that helps a little. You can use the same script to do
searches on other files. :)
I like to use "n-gram">.
John F. Sowa replies:
<Damon Davison's use of Google inspired me to try
a variation. I just typed three queries and
got the following number of hits:

Search string Hits
------------- ------
ngram 21,100
ngram not perl 540
n-gram 85,700

This seems to provide overwhelming evidence for
a hyphen between "n" and "gram". Since Google
doesn't distinguish capitals, that leaves the
capitalization question unresolved.

But Stefan Evert then admonishes caution <you do not realise that "ngram not
perl" found approx. 540 pages that
contain all three words ("ngram", "not" and "perl"), don't you?
You can see this quite clearly when you look at the result page where
the matching keywords are highlighted.>

Yannick Versley elaborated: <Asking google for n-gram may not do what you
intended, since your query will match all of ngram, n-gram and n gram. Even
then, looking for "n gram" (which will match n-gram and n gram) returns
68.900 hits, so n-gram is probably still the right one.
What I got from google:
search str. hits
-------------- ---------
ngram 20 400
ngram -perl 16 100
"n gram" 68 500
"n gram" -perl 63 100

Andrew Kehoe advises:
<You need to use the search term "ngram -perl" rather than "ngram not perl"
because, as Stefan Evert pointed out, "ngram not perl" just returns pages
containing all 3 of those words.

Another problem with your method is that Google ignores hyphens in search
terms. One of the pages returned for the term "n-gram" is
http://cpan.dei.uc.pt/authors/id/J/JH/JHI/ngram.pl-1.48&e=8092
<http://cpan.dei.uc.pt/authors/id/J/JH/JHI/ngram.pl-1.48&e=8092> but this
page does not contain the word "n-gram" at all, only "ngram" without the
hyphen.>

It looks like the searchers will come up with, or tell us how to come up
with, reliable frequency counts. If so, and always bearing in mind the GIGO
principle, I wonder is Noah Smith right when he surmises above: <If you get
an empirical result that supports a consensus, maybe we won't have to resort
to this workaround>. The issues of capitalization and italicization might
be measurable. Nonetheless I suspect that editors and writers in the cluster
of discourse communities on CORPORA (see Harold Somers above) will continue
with their current usage unless shown overwhelming counterevidence.
Best wishes
John McKenny
Ps FINAL APPEAL: could you please send me your MWU/formulaic sequence/
chunking answer by 17 March St. Patrick's Day. Thanks




====
This e-mail is intended solely for the addressee. It may contain private and
confidential information. If you are not the intended addressee, please take
no action based on it nor show a copy to anyone. Please reply to this e-mail
to highlight the error. You should also be aware that all electronic mail
from, to, or within Northumbria University may be the subject of a request
under the Freedom of Information Act 2000 and related legislation, and
therefore may be required to be disclosed to third parties.
This e-mail and attachments have been scanned for viruses prior to leaving
Northumbria University. Northumbria University will not be liable for any
losses as a result of any viruses being passed on.






More information about the Corpora-archive mailing list