[Corpora-List] ANC Bigrams and Trigrams

Nicolas Hernandez nicolas.hernandez at gmail.com
Mon Feb 14 14:30:00 CET 2005


On Fri, 11 Feb 2005 14:42:18 -0500, Nancy Ide <ide at cs.vassar.edu> wrote:

> We are generating bigram and trigram data from the ANC First Release,

> which will very soon be available on the (new and improved) ANC

> website. We have a question for those who might be interested in this

> kind of data: is it useful to generate the data for word pairs/triples

> that span sentence (or even paragraph) boundaries? Is there any

> advantage if we provide two sets of the bigram and trigram data, one

> that spans such boundaries and one that doesn't?


Dear Nancy,

Personally I have used n-grams to extract "meta-discourse expressions"
(basically frequent n-grams occurring in a corpus with a specific
genre). I was interested by punctuation marks, because they could give
me some contextual indications which could be used to select them".
For exemple :
"in this section" could have a different discourse interpretation at
the start (". In this section") and at the end of a sentence ("in this
section .") (depending on text genre).

According to me, it makes more accurrate statistical measures having
such ngrams.

/Nicolas


>

> Thanks,

> Nancy Ide

>

> =======================================================

>

> Nancy Ide

>

> Professor of Computer Science

> Vassar College

> Poughkeepsie, NY 12604-0520 USA

> Tel: +1 845 437-5988 Fax: +1 845 437-7498

> ide at cs.vassar.edu

>

> Chercheur Associe

> Equipe Langue et Dialogue, LORIA/CNRS

> Campus Scientifique - BP 239

> 54506 Vandoeuvre-les-Nancy FRANCE

> Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79

> ide at loria.fr

>

> =======================================================

>

>



--
Nicolas Hernandez
LIR - LIMSI
BP 133, 91403 Orsay Cedex
tel. 01 69 85 80 03, fax 01 69 85 80 88
IIE - CNAM
tel. 01 69 36 73 48





More information about the Corpora-archive mailing list