[Corpora-List] question as to MI and t score

Serge HEIDEN slh at ens-lsh.fr
Tue Dec 20 13:14:00 CET 2005

Sorry to catch lately a now tepid thread.

I agree with Stefan that being able to speak about
some strength of some collocation thing doesn't give
you much insight in what a collocation may be, and
at least may be useful for.
May I suggest to broaden the model you manipulate a bit :
- in the things observed and counted : your collocation
candidate words seem to be plain lexical items. You know
that word frequencies vary a lot with the linguistic role they
play in texts. It may be useful to place each candidate word
on a continuum from, say, grammatical items to lexical items.
Even if all words are on the lexical side, they may have
different interesting positions on this virtual axis. And this
can give you informations to better interpret your model.
Alas, today, and don't know of any formal model taking
that kind of information into account. Any ideas ?
- in the context they are counted to be together : you give
us no information on the type of texts involed. This, also,
can drasticaly change frequencies and interpretation of them.
Maybe it would be useful to place the effective contexts observed
on a continuum from, say, media/genre/style/register type
of context as a whole to a phrase/syntagm type.
To vary the size of the context may be another way to focalize
on a specific interpretation of a collocation model. You
can make words meet in varying sized - and moving -
windows in texts, or build contexts from typographic
heuristics like "hard" ponctuations ('.', '!', etc).
For example, in textometric tools, we generally start with small
contexts (with window sizes or word based n-grams)
to analyze candidate syntagms. After this, we can reconsider
what was initialy two candidate words as one candidate
compound word in more 'discourse' oriented cooccurrent analysis.
- in the way you take care of being together : in french
scientific litterature, we name collocates things being
somewhat in proximity on the syntagmatic axis, and
cooccurrents things being together without knowing
anything of their proximity. You could also take into
account the orientation of the meetings : X is before Y being
counted, or taken into account, differently than X is
after Y, on the syntagmatic axis.
- finally, in the way different couples could be compared :
and this takes us back to your initial question, how to compare
two couples ? May I suggest to try to compare ALL
couples together at the same time ? This way of doing
things is what some optimist guys call 'semantic maps'.
Today, I have only some very pragmatic propositions
to give in that area (see http://weblex.ens-lsh.fr/biblio
/slh/SergeHeidenCooccurrencesJADT2004Final.htm, as
an example introduction. Sorry, it is in french). There
are so many different parameters involved to build a specific
cooccurrent graph that we try to analyze all of it before
moving a single parameter.



Stefan Evert wrote:

>> Working out exactly what

>> upper bounds on this difference one can assume with how much

>> confidence is almost as difficult as a mathematical problem as

>> interpreting the differences is as a linguistic problem (what does

>> it really mean if the difference in collocational strength is at

>> most "1.7"??).


>>> Imagine you have called up collocation listings for the node word

>>> lemmas "play" and "fight". In both lists, the association with for

>>> example the collocates "role" and "battle" has the exactly the same

>>> MI / t score. Can I assume that both collocations, i.e. "play a

>>> role" and "fight a battle" have the same "collocational strength",

>>> or is that a wrong assumption?


>>> Thanks,

>>> Helene

Serge Heiden, slh at ens-lsh.fr, https://weblex.ens-lsh.fr
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Franšaise
15, parvis RenÚ Descartes 69342 Lyon BP7000 Cedex, tÚl. +33(0)632010638

