I am leading a project for building a text corpus for medieval Norwegian. The project is under the Menota umbrella (www.menota.org) and the texts are encoded in TEI P5 Menota extension (Medieval Nordic Text Archive) (see www.menota.org for the Menota handbook).
The corpus will consist of 1.5 million running words (which is a lot when transcribed from manuscripts and not from editions) out of which 1.0 will be given a morphosyntactic encoding out of which 0.5 will be encoded as syntactic trees (treebank). The treebank xml-format will be according to the Univ of Stuttgart's TIGER format.
In Menota (as in all corpora I have been involved in the development of or,) the Corpus Linguist Workbench (CLW/CQP) from Univ. of Stuttgart is the standard choice of corpus search system. However, CLW/CQP is old and has only been maintained and not developed the last 10 years( I know ab out the open corpus workbench initative) For example the unicode support is meager.
Do you have any suggestion for a more up to date system e.g. with full unicode support. Could lucene be a candiate?