[Corpora-List] Part of Speech annotation of Persian and Urdu corpora

Ben Allison B.Allison at dcs.shef.ac.uk
Wed Feb 27 12:44:36 CET 2008


Bushra,

I'm not sure whether you want human-annotated text from which to induce a tagger, or are interested in having a working POS tagger itself. If the latter, then about a year ago we tracked down a 10 million word corpus of Persian which had been hand-annotated, and induced a tagger from the 1 million word part that the creators were prepared to give away for research purposes. The tagset they used (which they created for the job) could be interpreted on two levels -- there was a coarse tagset of 14 tags with categories like Noun, Verb, etc. and a much finer one which I believe ran to about 150 tags. Accuracies were pretty good -- over 98% for coarse tags, and around 92% for the fine ones.

I'm not sure if you're prepared for a DIY approach, but I suspect that if you are, you could get hold of the corpus we used (I can pass you contact information) and use one of many trainable taggers to induce your own. Of course, this might not be what you were thinking of...

Ben

hfaili at ece.ut.ac.ir wrote:
> Dear Bushra,
> I am working in an Iranian Company (named Douran www.douran.com) which
> have a good experience and a tools for POS tagging, and other NLP fields
> in Persian...
> for more information contact me via hfaili at douran.com
> regards
>
> hello
> I was wondering if anybody knows of any companies or individual linguists
> who would do Part of Speech annotation of Persian and Urdu corpora?
>
> Thank you
> Bushra Zawaydeh
>
> ********************************************************************
> Bushra Zawaydeh bushraz at basistech.com
> Senior Linguist
> Basis Technology Tel: (617)386-7130
> One Alewife Center Fax: (617)386-2020
> Cambridge, MA 02140-2327
> USA
> **********************************************************************
>
>
> --------------------------------------------------------------------------------
> Helping your favorite cause is as easy as instant messaging. You IM, we
> give. Learn more.
>
> __________ NOD32 2853 (20080206) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



More information about the Corpora mailing list