# [Corpora-List] Significance test for TTR

Benjamin Allison ballison at staffmail.ed.ac.uk
Mon Nov 21 11:20:50 CET 2011

Chris,

I will assume you have good reason for asking about vocabulary richness measures, mindful of the fact (as David and others point out) that vocabulary richness might not be all that useful, and concentrate on the technical question.

TTR is a bad idea to use on at least two counts: one technical and the other practical. The practical concern is that it's highly dependent on text length, as others point out, so you'd have to use the same size sample which would mean throwing out some data (always a bad place to start with statistics!). The second is that there's no reason to believe it has any kind of distribution in particular, so you'd end up just using something out of the box which wouldn't be appropriate (like the t-test).

A better measure, stable across text lengths in my experience, is the proportion of word token pairs in the text which are the same word, i.e. if your corpus were:

w_1 w_2 w_3 w_1 w_2

there are 10 pairs:

w_1 w_2 w_1 w_3 w_1 w_1 * w_1 w_2 w_2 w_3 w_2 w_1 w_2 w_2 * w_3 w_1 w_3 w_2 w_1 w_2

of which two are the same word. This can be viewed as a binomial parameter, where the ML estimator in this case would be 2/10, and so if you have two corpora you wish to compare you're looking to compare you can use a test for comparing binomial parameters.

There are lots of ways to go about this, but most people will suggest using a normal approximation to the binomial and then testing for different means. Beware if the sample sizes are very different, the assumption of equal variance will not hold. There are methods for testing binomial populations directly, but they're a bit more involved.

If you're tempted to go down the road of assuming normality, I'd suggest the logistic transform of the parameters first (since your estimated parameters will be close to 0, which is where normal approximations break down), and if I recall correctly the approximation gets better still if you take the difference of your (transformed) parameters and test for a mean of this significantly different to zero.

One final word of caution--most tests will be assuming independent samples, and this will not hold in this case, and here's why. If you observe in one sample a value of, say, 0.00001, this tells you something about the likely distribution of the statistic in the other sample (for example, 0.99999 would be a pretty unlikely outcome...). A more realistic scenario would be where the values of the statistics in both samples are drawn from some common, underlying prior, but that's probably going to get quite involved and I'm not sure how accurate you want the test to be. In any case, the effect that this will have (just so you're aware) is that you may judge your two samples to be from the same population when in fact they're different, it's just that the range of possible values is far narrower than you're allowing.

Hope that helps.

B

Quoting "David L. Hoover" <david.hoover at nyu.edu> on Sun, 20 Nov 2011 22:43:54 -0500:

> Dear Chris,
>
> George has given a good explanation of some of the problems. A much
> more severe problem is that lexical diversity/vocabulary richness is
> simply not a very reliable statistic for differentiating
> texts/authors. Although Tweedie and Baayen conclude that it can be
> used with caution, my own research has shown that lexical diversity
> shows extreme fluctuation within the works of a single author and
> even between different sections of the same text. Perhaps there
> might be a more systematic and reliable difference between text
> types than between authors or texts, but lexical diversity is so
> variable that even this doesn't seem very likely. For more detail ,
> see my
> “Another Perspective on Vocabulary Richness.” Computers and the
> Humanities, 37(2), 2003: 151-78.
>
> Best,
> David Hoover
>
> On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>>
>> Dear Chris,
>>
>> First things first. TTR is highly dependent to text length so you
>> have to be sure that the measurements have been taken from equal
>> size text samples. Otherwise you should use a more robust index
>> such as Yule’s K or Zipf’s Z (see the [1] for a detailed
>> description of this problem). Now coming to your original question,
>> TTR is a continuous variable and you could use the whole range of
>> parametric statistics. This means that you can use a t-test if you
>> want to check whether TTR is significant different across two
>> classes (e.g. Gender distinction in essays), or ANOVA if your
>> independent variable has many classes (e.g. Text Genre, Text Topic
>> etc). You can also implement a linear regression model with
>> dependent variable TTR and independent variables the ones that
>> describe your research hypothesis. In all the above cases you need
>> multiple TTR measurements because inferential statistics are based
>> on the distribution parameters of the TTR. There is also the option
>> to compare a single TTR value to a distribution of TTR values using
>> one-sample location test (also called Z test) which actually can
>> tell you how the specific TTR value lies away from the mean of the
>> TTRs.
>>
>> If the only thing you know are just 2 TTR values I don’t think you
>> can compare them in any meaningful way.
>>
>> Best
>>
>> George Mikros
>>
>> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may
>> a constant be? Measures of lexical richness in perspective.
>> Computers and the Humanities, 32(5), 323-352.
>>
>> ____________________________
>>
>> George K. Mikros
>>
>> Associate Professor of Computational and Quantitative Linguistics
>>
>> Department of Italian Language and Literature
>>
>> School of Philosophy
>>
>> National and Kapodistrian University of Athens
>>
>> Panepistimioupoli Zografou, GR-15784
>>
>> Athens, Greece
>>
>> Tel: +30 210 7277491, +30 6976111742
>>
>> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>>
>> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>>
>> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On
>> Behalf Of *CRuehlemann at aol.com
>> *Sent:* Sunday, November 20, 2011 7:21 PM
>> *To:* CORPORA at uib.no
>> *Subject:* [Corpora-List] Significance test for TTR
>>
>> Hi all,
>>
>> The type token ratio (TTR) is a measure of the lexical diversity of
>> a text/text type. If one finds in two texts/text types two widely
>> differing TTRs, one would like to assess the significance of this
>> finding.
>>
>> Which test is appropriate for differences between TTRs?
>>
>> Best
>>
>> Chris
>>
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________