[Corpora-List] Comparing n-grams / authorship

Khalid CHOUKRI choukri at elda.org
Wed Apr 18 10:05:58 CEST 2012


Hi Mark

I am sure you are aware of the work conducted within the CLEF project (Cross-Language Evaluation Forum) A large part of the evaluation is about Authorship and addresses this from various angles (Plagiarism, "Vandalism", Authorship identification/attribution, etc.) see details of the PAN lab at: http://clef2011.org/index.php?page=pages/labs_program.html

Hope this helps

Khalid

Justin Washtell wrote, On 17/04/2012 23:17:
> Hi Mark,
>
> Statistics such as Log-Likelihood (see http://ucrel.lancs.ac.uk/llwizard.html), can give an indication of how significant are differences in observed freqeuencies of events.
>
> These sorts of statistics assume a null-hypothesis in which everyhing is entirely random or unrelated, outside of which things are considered to be "significant". You need to be careful with this. Often in reality - as in your case I think - what you are looking for is actually more subtle.
>
> For example, I would suggest that you will at least want to look at similar n-gram statistics derived from all other pairwise combinations of chapters in your particular corpus, to establish whether what is observed between A and B is somehow "special" in your case.
>
> Also, I imagine the observed frequencies of those lower order n-grams which constitute your longer n-grams will have a bearing on how remarkable the figures are before you even start looking at the relative differences. For getting a handle on that, the language modelling literature may be useful.
>
> Sorry I can not be more specific. I'm not a statistician :-)
>
> Justin Washtell
> University of Leeds
>
>
> ________________________________________
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Yorick Wilks [Y.Wilks at dcs.shef.ac.uk]
> Sent: 17 April 2012 21:03
> To: Mark Davies
> Cc: corpora at uib.no
> Subject: Re: [Corpora-List] Comparing n-grams / authorship
>
> The questioner might want to look at the METER project: http://aclantho3.herokuapp.com/catalog/P02-1020
> This was an attempt to determine if one text had been rewritten from another based on ngrams---in a journalism and press service context (rather than plagiarism). it turned out that such texts could have very long ngrams in common without having been rewritten from ecah other.
> Yorick Wilks
>
>
> On 17 Apr 2012, at 15:47, Mark Davies wrote:
>
>> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
>>
>> Mark Davies, BYU
>>
>> -------------------------------------------
>>
>>
>> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one 9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.
>>
>> The question is:
>>
>> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- *Khalid Choukri * ELRA General secretary & ELDA CEO email: choukri at elda.org; Web: www.elra.info www.elda.org Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30

**************************************************** ** Info on LREC 2012 : www.lrec-conf.org *************************************************** * -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6494 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120418/1381fba1/attachment.txt>



More information about the Corpora mailing list