Best regards, Dan Tufis Quoting chris brew <cbrew at acm.org>:
> My point was not that MULTEXT-EAST already meets Mike's needs
> exactly, butthat
> it is a good basis for an extensible, cross-linguistically useful tagset
> that does
> do this.
>
> The PDT 2.0 documentation, chapters 3 and 4, has a detailed discussion
> of names and abbreviations,including the names of horses, DJs and Julie
> Sedivy (whose
> name is, they say, of Czech origin, but adjusted/smoothed over at some point
> to fit in with
> non-Czech expectations).
>
> See http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/
>
> There is much more.
> Chapter 13.7 documents how to tag chess moves, should such occur in the
> data.
> Much of this is specific to Czech and Czech newspaper norms. It is the
> general
> approach (non-atomic tags, excellent documentation, care for the realities
> of the
> data) that I advocate. It is not a ready made solution, but a good basis for
> one.
>
> On 06/02/2008, Eric Atwell <eric at comp.leeds.ac.uk> wrote:
>>
>> Correct me if I'm wrong, but I thought EAGLES, MULTEXT-EAST etc tagsets
>> dont make the kind of distinctions Mike alluded to - eg distinguisihng
>> 3 from three, foreign-names from local-names, and categorising
>> non-dictionary
>> words with something more than "unknown". This isnt PoS0tagging in the
>> traditional sense, whcih EAGLES etc extended from English to other
>> languages
>>
>> Eric Atwell Leeds University
>>
>>
>>
>> On Wed, 6 Feb 2008, Serge Sharoff wrote:
>>
>> > My vote goes to MULTEXT-EAST (MTE). For its next version it has been
>> adapted to include Persian, FInnish and Hungarian in addition to Slavonic
>> languages in Version 3, so it's quite flexible. However, MTE might be an
>> overkill for your purposes, as the tagset for Russian has more than 600 tags
>> (in you take into account all combinations of cases, numbers, genders,
>> tenses, etc), but the English set is much smaller.
>> > S
>> >
>> > -----Original Message-----
>> > From: corpora-bounces at uib.no on behalf of chris brew
>> > Sent: Tue 05/02/2008 22:50
>> > To: maxwell at umiacs.umd.edu
>> > Cc: corpora
>> > Subject: Re: [Corpora-List] Tag sets
>> >
>> > You could start from something that has already been applied
>> multilingually,
>> > such as
>> > the MULTEXT-EAST materials at
>> > http://nl.ijs.si/ME/V3/msd/html/
>> >
>> > or
>> >
>> > http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/index.html
>> >
>> > which is a similar thing developed by the Prague group for Czech.
>> > Anna Feldman, Jiri Hana (who co-wrote the pdt manual above)
>> > and I have some experience in using
>> > adapted versions for Russian, Polish, Spanish and Catalan. It
>> > would be fun to find out if the same ideas work for Bengali etc.
>> >
>> > Chris
>> >
>> >
>> > On 05/02/2008, maxwell at umiacs.umd.edu <maxwell at umiacs.umd.edu> wrote:
>> >>
>> >> We're looking at annotating a small sample (~5k words) of Bengali text,
>> >> and later maybe Urdu and Punjabi. The annotation will be the
>> dictionary
>> >> citation form of each word. The texts are mostly news articles, so
>> there
>> >> are a fair number of words for which there won't be any dictionary
>> >> citation form. These include many proper names, numerals, acronyms,
>> and
>> >> who knows what else. I'll refer to these as "non-dictionary words",
>> >> whereas "dictionary words" will include words whose citation form is in
>> >> the dictionary we're using, even if the inflected wordform itself is
>> not.
>> >> (We're doing this to test a morphological parser.)
>> >>
>> >> This is not quite the same as the inverse of named entity tagging,
>> since
>> >> some parts of names may have citation forms. For example, in English
>> one
>> >> would tag "Mississippi River" as a name. But "River" would be found in
>> >> the dictionary, so for our purposes we would only want to tag
>> >> "Mississippi" as a non-dictionary word.
>> >>
>> >> The simplest thing for us to do would be to just tag all such
>> >> non-dictionary words the same way, e.g. with a tag "NOT". However, in
>> the
>> >> interest of future uses to which we might put such a tagged text, it
>> might
>> >> be good to differentiate among the various kinds of non-dictionary
>> words.
>> >>
>> >> We could easily make up our own tagset for non-dictionary words, but it
>> >> strikes me that better would be to use some standard tagset for such
>> >> words, if such a tagset exists. There is a table of tagsets in Manning
>> >> and Schutze pg. 141-2, including the Penn Treebank, Brown, and CLAWS.
>> >> However, the tagsets are English-specific. This is especially
>> noticeable
>> >> in the punctuation tags for the PTB and Brown sets, but also e.g. in
>> the
>> >> decision to tag singular and plural proper nouns differently. (Some
>> >> languages attach case markers to proper nouns.) Also, it appears that
>> >> none of the tagsets distinguishes between numerals ('3', '4.5') and
>> >> numbers written out ('three', 'four point five'), which we need to do,
>> nor
>> >> are acronyms distinguished from "symbols".
>> >>
>> >> Another distinction I thought about making is between "ordinary"
>> Bengali
>> >> names, and foreign names, since one might later want to develop a
>> >> transducer to convert the latter into their more common Latin forms.
>> >> However, I suspect that might be too difficult a distinction for
>> >> annotators to make, and in any case some well-known Bengali names are
>> >> likely to have "standard" transliterations.
>> >>
>> >> Does anyone know of a semi-standard tagset that would be less
>> >> English-specific, and would make the kinds of distinctions among
>> >> non-dictionary words that we want to (or should) make? Or should we
>> just
>> >> make up our own set?
>> >>
>> >> Mike Maxwell
>> >> CASL/ U MD
>> >>
>> >>
>> >> _______________________________________________
>> >> Corpora mailing list
>> >> Corpora at uib.no
>> >> http://mailman.uib.no/listinfo/corpora
>> >>
>> >
>> > _______________________________________________
>> > Corpora mailing list
>> > Corpora at uib.no
>> > http://mailman.uib.no/listinfo/corpora
>> >
>>
>> --
>> Eric Atwell,
>> Senior Lecturer, Language research group leader, School of Computing,
>> Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
>> TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell
>>
>
---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. Host: valhalla.racai.ro Version: IMP 4.1.5 (H3) (Horde 3.1.5)
-- This message was scanned for spam and viruses by BitDefender. For more information please visit http://linux.bitdefender.com/