-----Original Message----- From: corpora-bounces at uib.no on behalf of chris brew Sent: Tue 05/02/2008 22:50 To: maxwell at umiacs.umd.edu Cc: corpora Subject: Re: [Corpora-List] Tag sets
You could start from something that has already been applied multilingually, such as
the MULTEXT-EAST materials at http://nl.ijs.si/ME/V3/msd/html/
which is a similar thing developed by the Prague group for Czech. Anna Feldman, Jiri Hana (who co-wrote the pdt manual above) and I have some experience in using adapted versions for Russian, Polish, Spanish and Catalan. It would be fun to find out if the same ideas work for Bengali etc.
On 05/02/2008, maxwell at umiacs.umd.edu <maxwell at umiacs.umd.edu> wrote:
> We're looking at annotating a small sample (~5k words) of Bengali text,
> and later maybe Urdu and Punjabi. The annotation will be the dictionary
> citation form of each word. The texts are mostly news articles, so there
> are a fair number of words for which there won't be any dictionary
> citation form. These include many proper names, numerals, acronyms, and
> who knows what else. I'll refer to these as "non-dictionary words",
> whereas "dictionary words" will include words whose citation form is in
> the dictionary we're using, even if the inflected wordform itself is not.
> (We're doing this to test a morphological parser.)
> This is not quite the same as the inverse of named entity tagging, since
> some parts of names may have citation forms. For example, in English one
> would tag "Mississippi River" as a name. But "River" would be found in
> the dictionary, so for our purposes we would only want to tag
> "Mississippi" as a non-dictionary word.
> The simplest thing for us to do would be to just tag all such
> non-dictionary words the same way, e.g. with a tag "NOT". However, in the
> interest of future uses to which we might put such a tagged text, it might
> be good to differentiate among the various kinds of non-dictionary words.
> We could easily make up our own tagset for non-dictionary words, but it
> strikes me that better would be to use some standard tagset for such
> words, if such a tagset exists. There is a table of tagsets in Manning
> and Schutze pg. 141-2, including the Penn Treebank, Brown, and CLAWS.
> However, the tagsets are English-specific. This is especially noticeable
> in the punctuation tags for the PTB and Brown sets, but also e.g. in the
> decision to tag singular and plural proper nouns differently. (Some
> languages attach case markers to proper nouns.) Also, it appears that
> none of the tagsets distinguishes between numerals ('3', '4.5') and
> numbers written out ('three', 'four point five'), which we need to do, nor
> are acronyms distinguished from "symbols".
> Another distinction I thought about making is between "ordinary" Bengali
> names, and foreign names, since one might later want to develop a
> transducer to convert the latter into their more common Latin forms.
> However, I suspect that might be too difficult a distinction for
> annotators to make, and in any case some well-known Bengali names are
> likely to have "standard" transliterations.
> Does anyone know of a semi-standard tagset that would be less
> English-specific, and would make the kinds of distinctions among
> non-dictionary words that we want to (or should) make? Or should we just
> make up our own set?
> Mike Maxwell
> CASL/ U MD
> Corpora mailing list
> Corpora at uib.no