[Corpora-List] Free morphological analyser for Polish

Adam Radziszewski kocikikut at gmail.com
Wed Apr 6 16:06:33 CEST 2011


Dear corpora members, we've released an open morphological analyser for Polish. The analyser consists of two parts: • the morphological dictionary, resulting from tagset conversion from Morfologik 1.7(morfologik.blogpot.com) — licensed under Creative Commons ShareAlike or GNU LGPL (the user is free to choose), • configurable morphological analysis and tokenisation framework called Maca (GNU GPL; bundled with ready-to-use configurations for Polish and the above dictionary compiled as a transducer).

The analyser is able to output in the tagset of the IPI PAN Corpus. This is important, since MSD taggers for Polish (at least TaKIPI and Pantera) resort to external analysers when tagging plain text — and to the best of our knowledge, there is no other free combination of a training corpus and an analyser that operate on the same tagset.

Dictionary “source” and its description: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted The MACA system: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/

The mentioned fragment of the IPI PAN corpus is available at: http://korpus.pl/index.php?lang=en&page=download

It's also worth noting that the MACA suite contains a tokeniser (“toki”) that is probably the first C++ open-source implementation of SRX segmentation rules. Both toki and maca proper may be used as shared libraries or by their simple command-line utils (tested only under GNU/Linux).

Best regards, Adam -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1873 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110406/6c554ff6/attachment.txt>



More information about the Corpora mailing list