[Corpora-List] how developing a lexicon

Dom Widdows widdows
Tue Apr 7 17:13:25 CEST 2009

Dear Farhad,

I think one of the questions that would first leap to mind for many people on this list is what language pairs are you interested in, and what parallel / comparable corpora do you have available?

If you have parallel corpora, there are lots of methods for extracting term pairs that are relatively readily available nowadays (e.g., we have some support for this in SemanticVectors, see http://code.google.com/p/semanticvectors/wiki/BilingualModels).

In general, over recent decades many approaches have changed from looking at questions like "give me a list of all English verbs and their conjugations" to questions like "given a sample of the data you're interested in working with, give me a list of prevalent English verbs and their conjugations".

Best wishes, Dominic

On Sun, Apr 5, 2009 at 7:17 AM, Eros Zanchetta <eros.zanchetta at gmail.com> wrote:
> Dear Farhad,
> if you're building a lexicon from scratch, you might be interested in
> the paper:
> Eros Zanchetta and Marco Baroni (2005) Morph-it! A free corpus-based
> morphological resource for the Italian language, proceedings of Corpus
> Linguistics 2005, University of Birmingham, Birmingham, UK
> (http://sslmit.unibo.it/~eros/downloads/Morph-it.pdf).
> It describes a method for the rapid creation of a lexicon using a
> mixture of corpus based techniques and manual checking. We created an
> Italian lexicon, but the method may be applied to other languages too.
> Best,
> Eros Zanchetta
> Farhad Atghiaee wrote:
>> dear members
>> i am now developing a lexicon for a Machine Translation system.
>> if anybody knows a helpful source or data i would appreciate it.
>> as an example, suppose we want to gather all English verbs and their
>> conjugations, is there a resource for it?
>> regards
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list