[Corpora-List] Comparing n-grams / authorship

Yorick Wilks Y.Wilks at dcs.shef.ac.uk
Tue Apr 17 22:03:01 CEST 2012

The questioner might want to look at the METER project: http://aclantho3.herokuapp.com/catalog/P02-1020 This was an attempt to determine if one text had been rewritten from another based on ngrams---in a journalism and press service context (rather than plagiarism). it turned out that such texts could have very long ngrams in common without having been rewritten from ecah other. Yorick Wilks

On 17 Apr 2012, at 15:47, Mark Davies wrote:

> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
> Mark Davies, BYU
> -------------------------------------------
> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one 9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.
> The question is:
> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list