[Corpora-List] Comparing n-grams / authorship

Rui Sousa-Silva sousasilva.rui at gmail.com
Fri Apr 20 18:23:51 CEST 2012

Dear Mark,

As has already been suggested in this thread -- and as Alberto illustrated below -- authorship analysis has been researched in some depth and successfully by many forensic linguists worldwide. I thought that, besides other references already suggested, your colleague might be interested in some of the work with done, such as:

[1] Sousa-Silva, R., Sarmento, L., Grant, T., Oliveira, E.C. & Maia, B. (2011) 'Comparing Sentence-Level Features for Authorship Analysis in Portuguese'. IN Proceedings of the Computational Processing of the Portuguese Language. (This paper presents results on authorship analysis of newspaper editorials, and might be of particular interest your colleague - http://paginas.fe.up.pt/~niadr/PUBLICATIONS/2010/60010051.pdf)

[2] Sousa-Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E.C. & Maia, B. (2011) ''twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages'. IN R. Muñoz, A. Montoyo and E. Métais (Eds.). Lecture Notes in Computer Science 6716 Springer 2011 (Paper on authorship of micro-blogging messages - http://paginas.fe.up.pt/~niadr/PUBLICATIONS/2011/Twitter-NLDB2011.pdf)

[3] Grant, T. (2010) 'Txt 4n6: Idiolect free authorship analysis'. IN M. Coulthard and A. Johnson (Eds.) The Routledge Handbook of Forensic Linguistics. London: Routledge.

I hope these help!

Regards, Rui

On 18/04/2012, at 10:02, Alberto Barron Cedeño wrote:

> Dear Mark,
>> From the numbers you mention ({6,7,8,9}-grams in common), it is very
> likely that the book chapters have a co-derivation relationship (either
> one of them was considered when producing the other or both considered a
> common source).
> You both can first look at the point of view of forensic linguistics.
> [1] considers that "the longer a phrase, the less likely you are going
> to find anybody use it". Experts estimate that (assuming circa 40% of
> the words in a text are lexical) documents on the same topic could share
> around 25% of lexical words. But if two documents contain circa 60% of
> lexical words in common, they can be considered related [2]. Obviously
> in this case we are talking about 1-grams. For higher level n-grams the
> expected amount of shared terms is much lower.
> This fact takes us to the concept of "uniqueness": every person is
> linguistically unique; no two people exist that express their ideas in
> the exact same way [3]. Inspired in some slides presented by M.
> Coulthard and M.T. Turell at PAN 2011 (see below), I tried a simple
> "uniqueness" experiment. I took a set of phrases and split them in
> n-grams of increasing order (0<n<14). The resulting chunk was quoted and
> queried to a commercial search engine. I attach the results (don't worry
> about the different colours, consider all of them as randomly selected
> phrases): it is extremely unlikely that two sequences of text (already
> from n=6) will occur in two presumably independent documents. You could
> try the same exercise with the fragments you mention.
> Now, what about two documents written by one single author? Table 1 in
> [4] shows a toy experiment we carried out considering four documents
> written by the same authors: On average only 3% of the 4-grams in two
> documents occurred in common (versus 16% of 1-grams and 11% of 2-grams).
> Note we are talking about documents on the same topic, by the same
> authors.
> You or your colleague might be interested in the PAN Initiative
> (http://pan.webis.de), where automatic plagiarism detection and
> authorship identification tasks are included, among others. You can get
> an overview of the different models applied to these tasks from the
> previous editions of the lab (everything is available online). The
> Coulthard and Turell slides I mentioned before are available from the
> 2011 edition site (PAN @ CLEF'11), accesible from the same PAN website.
> [1] Coulthard, Malcolm. ‘Author Identification, Idiolect, and Linguistic
> Uniqueness’. Applied Linguistics 25 (December 1, 2004): 431–447.
> [2] Coulthard, M. (2010). The Linguist as Detective: Forensic
> Applications of Language Description.
> [http://bit.ly/madrid_lingforense], Madrid, Spain. Talk at: Jornadas
> (In)formativas de Lingüística Forense ((In)formative Conference on
> Forensic Linguistics).
> [3] Coulthard, M. and Alison, J. (2007). An Introduction to Forensic
> Linguistics: Language in Evidence. Routledge, Oxon, UK.
> [4] Barrón-Cedeño, A., Rosso, P. On Automatic Plagiarism Detection based
> on n-grams Comparison. In: Boughanem et al. (Eds.) ECIR 2009, LNCS 5478,
> pp. 696-700, Springer-Verlag Berlin Heidelberg (2009)
> Kind regards,
> Alberto
> --
> Alberto Barrón-Cedeño
> Department of Information Systems and Computation (Ph.D. student)
> Universidad Politécnica de Valencia
> http://www.dsic.upv.es/~lbarron
> On Tue, 2012-04-17 at 19:47 +0000, Mark Davies wrote:
>> I am sending the following question on behalf of a colleague at BYU. Thanks in advance for any suggestions you have; I'll forward them to the researcher who is working on this problem.
>> Mark Davies, BYU
>> -------------------------------------------
>> I am working with a 250,000 word text. Within this text there are two chapters, A and B (1,200 and 2,400 words respectively). The authorship of these two chapters is unknown, but we have reason to believe to that the author(s) of A and B have a relationship that is different from the majority of the rest of the book. There are two 4-grams, three 6-grams, one 7-gram, one 8-gram, and one 9-gram shared in common in chapters A and B that appear nowhere else in the book. Intuitively it seems like there is a unique relationship between chapters A and B.
>> The question is:
>> Is there a statistical method of measuring whether the types of n-grams above establish a reasonable probability that the two texts are linked.
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> <uniqueness_example.png>_______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7957 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120420/f09a6e10/attachment.txt>

More information about the Corpora mailing list