The analyser is able to output in the tagset of the IPI PAN Corpus. This is important, since MSD taggers for Polish (at least TaKIPI and Pantera) resort to external analysers when tagging plain text — and to the best of our knowledge, there is no other free combination of a training corpus and an analyser that operate on the same tagset.
Dictionary “source” and its description: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted The MACA system: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/
The mentioned fragment of the IPI PAN corpus is available at: http://korpus.pl/index.php?lang=en&page=download
It's also worth noting that the MACA suite contains a tokeniser (“toki”) that is probably the first C++ open-source implementation of SRX segmentation rules. Both toki and maca proper may be used as shared libraries or by their simple command-line utils (tested only under GNU/Linux).
Best regards, Adam -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1873 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110406/6c554ff6/attachment.txt>