[Corpora-List] Software tool and library for efficient n-gram, skipgram extraction and corpus analysis

Maarten van Gompel proycon at anaproy.nl
Mon Sep 21 15:09:40 CEST 2015

I would like to announce release v1.0 of Colibri Core, software for working with basis linguistic constructions such as n-grams and skipgrams, in a quick memory-efficient yet lossless way suitable for big data:

See https://proycon.github.io/colibri-core

Colibri Core enables you to:

* extract patterns and their frequency from corpora

* preserve the exact indices where patterns occur in the corpus, allowing reverse-lookup as well

* model various relationships between patterns (subsumption, succesion, abstraction, co-occurrence)

* compare patterns between different corpora (using coverage metrics and/or log-likelihood)

The software is open-source (GPL) and consists of command-line tools and a programming library for both C++ and bindings for Python. The software aims to lay a foundation for more specialised or end-user-oriented software to be built upon.



Maarten van Gompel

Centre for Language Studies

Radboud Universiteit Nijmegen

proycon at anaproy.nl http://proycon.anaproy.nl http://github.com/proycon

GnuPG key: 0x1A31555C XMPP: proycon at anaproy.nl Telegram: proycon IRC: proycon (freenode) Twitter: https://twitter.com/proycon Bitcoin: 1BRptZsKQtqRGSZ5qKbX2azbfiygHxJPsd

More information about the Corpora mailing list