[Corpora-List] Comparing n-grams / authorship

Alberto Barron Cedeño lbarron at dsic.upv.es
Wed Apr 18 19:04:44 CEST 2012

(I just realised my reply to this question didn't reach the list members because I added a png file as attachment. Please, when you read "the attached file", go to: http://nlp.dsic.upv.es/tmp/uniqueness_example.png I apologise in advance if at the end my message arrives twice)

=== Dear Mark,

>From the numbers you mention ({6,7,8,9}-grams in common), it is very
likely that the book chapters have a co-derivation relationship (either one of them was considered when producing the other or both considered a common source).

You both can first look at the point of view of forensic linguistics. [1] considers that "the longer a phrase, the less likely you are going to find anybody use it". Experts estimate that (assuming circa 40% of the words in a text are lexical) documents on the same topic could share around 25% of lexical words. But if two documents contain circa 60% of lexical words in common, they can be considered related [2]. Obviously in this case we are talking about 1-grams. For higher level n-grams the expected amount of shared terms is much lower.

This fact takes us to the concept of "uniqueness": every person is linguistically unique; no two people exist that express their ideas in the exact same way [3]. Inspired in some slides presented by M. Coulthard and M.T. Turell at PAN 2011 (see below), I tried a simple "uniqueness" experiment. I took a set of phrases and split them in n-grams of increasing order (0<n<14). The resulting chunk was quoted and queried to a commercial search engine. The results are in the attached file (don't worry about the different colours, consider all of them as randomly selected phrases): it is extremely unlikely that two sequences of text (already from n=6) will occur in two presumably independent documents. You could try the same exercise with the fragments you mention.

Now, what about two documents written by one single author? Table 1 in [4] shows a toy experiment we carried out considering four documents written by the same authors: On average only 3% of the 4-grams in two documents occurred in common (versus 16% of 1-grams and 11% of 2-grams). Note we are talking about documents on the same topic, by the same authors.

You or your colleague might be interested in the PAN Initiative (http://pan.webis.de), where automatic plagiarism detection and authorship identification tasks are included, among others. You can get an overview of the different models applied to these tasks from the previous editions of the lab (everything is available online). The Coulthard and Turell slides I mentioned before are available from the 2011 edition site (PAN @ CLEF'11), accesible from the same PAN website.

[1] Coulthard, Malcolm. ‘Author Identification, Idiolect, and Linguistic Uniqueness’. Applied Linguistics 25 (December 1, 2004): 431–447. [2] Coulthard, M. (2010). The Linguist as Detective: Forensic Applications of Language Description. [http://bit.ly/madrid_lingforense], Madrid, Spain. Talk at: Jornadas (In)formativas de Lingüística Forense ((In)formative Conference on Forensic Linguistics). [3] Coulthard, M. and Alison, J. (2007). An Introduction to Forensic Linguistics: Language in Evidence. Routledge, Oxon, UK. [4] Barrón-Cedeño, A., Rosso, P. On Automatic Plagiarism Detection based on n-grams Comparison. In: Boughanem et al. (Eds.) ECIR 2009, LNCS 5478, pp. 696-700, Springer-Verlag Berlin Heidelberg (2009)

Kind regards, Alberto -- Alberto Barrón-Cedeño Department of Information Systems and Computation (Ph.D. student) Universidad Politécnica de Valencia http://www.dsic.upv.es/~lbarron

On Tue, 2012-04-17 at 19:47 +0000, Mark Davies wrote:
> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
> Mark Davies, BYU
> -------------------------------------------
> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one 9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.
> The question is:
> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list