working with a large SMS corpus, I met the same situation you are describing. The article of Kilgarriff(2005) is very instructive and helped me to better understand the problem.
I also recommand you to read :
Grissom, R. J. and Kim, J. J. (2005). Effects sizes for Research : A Broad Practical Approach. Mahwah (N.J.) : Lawrence Erlbaum Associates.
They propose to use association measures (correlation) rather than a significance test such as the chi-squared, when working with a lot of data (as it often happens now in corpus linguistics). Indeed, association measures inform you better on the size of the effect between your variables (with a lot of data, you can have a very significant X² that will correspond to a poor correlation rate).
> Muhammad Shakir Aziz,
> the null hypothesis-testing you discuss here doesn't work in corpus
> linguistics - for the argument see
> Language is never ever ever
> 2005 *Corpus Linguistics and Linguistic Theory* 1 (2): 263-276.
> My rule of thumb is: it only counts if the ratio (of normalised
> is greater than/less than a factor of two between two text types
> On 28 June 2010 05:25, True Friend <true.friend2004 at gmail.com> wrote:
>> Good Day to All Copora Members
>> I am a masters in applied linguistics student, currently working on my
>> thesis. The topic of research is the use of ditransitive constructions.
>> authenticate the results I want to apply statistical techniques on the
>> research. For example I am trying to see whether there is a significant
>> difference in the usage of two alternative ditransitive patterns in PWE
>> (Pakistani Written English, the corpus I am working on for the
>> The alternative ditransitive patterns here mean Double Object (He gave
>> me a
>> pen) and To Dative (He gave a pen to me). I am pasting the table here,
>> contains genre names and frequencies of all verbs (used ditransitively)
>> that genre.
>> Genre D. Object To Dative ALT 0 4 ART 210 344 BKS 335 308 BLT 2 7
>> BRU 4 2 CLM 108 303 CST 0 7 DIR 1 7 EDT 8 32 FTW 23 14 INT 38 44
>> LDS 7 53 LTR 35 92 MGP 2 5 MNF 3 6 MNU 0 1 NLT 7 23 NVL 5 3 NWS
>> 108 OLT 44 9 PLC 0 1 PRS 11 22 RPR 19 60 RPT 4 17 SRY 0 7 STR 76
>> THS 20 36 TRN 30 19 WWW 16 30 Some facts about the data are as
>> Genre are not of equal in length (number of words) so there may be a
>> like ALT of a few hundred words, and another like ART of .5 million
>> Frequencies here are calculated by adding the occurrences of all the
>> occurred in the given genre in a given pattern.
>> I have applied Chi Square test using R and with this command "cxx =
>> chisq.test(x, correct = FALSE)" (while 'x' and 'cxx' are R objects) and
>> result was as follows.
>> Pearson's Chi-squared test
>> data: x
>> X-squared = 268.2688, df = 28, p-value < 2.2e-16
>> Going through the help manuals of R, I came to know that p-value
>> is a too much small number, so it means that the difference between the
>> variables (Double Object and To Dative) is significant, as p-value for
>> social sciences is considered p<0.005. Please correct me if I am
>> misunderstanding the test, its results or applying it incorrectly. And
>> this test is not suitable for such kind of analysis, and alternatively
>> kind of test should I apply. And last one last thing, I applied the test
>> normalized frequencies (which were calculated by dividing the frequency
>> each genre with the number of words it has, and the multiplying it with
>> 100,000 i.e. .1 million) but the chisquare result was same (same
>> Any help and comments would be highly appreciated.
>> Best Regards
>> Muhammad Shakir Aziz محمد شاکر عزیز
>> Masters in Applied Linguistics (last semester student)
>> Translator, Course Developer, Linguist for Urdu, Punjabi and English
>> Urdu:- http://awaz-e-dost.blogspot.com/
>> English:- http://linguisticslearner.blogspot.com/
>> Facebook:- http://www.facebook.com/truefriend2004
>> Skype:- true_friend2004
>> Corpora mailing list
>> Corpora at uib.no
> Adam Kilgarriff
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> Corpora mailing list
> Corpora at uib.no