[Corpora-List] Announcement: Release of the DependencyTreebankDatabase DTDB 1.0

tufis at racai.ro tufis at racai.ro
Wed Feb 6 17:59:57 CET 2008

Dear Olga,

At http://corp.hum.sdu.dk Eckhard Bick created a great grammatically annotated Romanian corpus and some other similar annotated corpora for various languages.

The Romanian corpus covers the business language domain and has a size of 21.4 million words (27 million tokens). It was compiled by Arina Greavu (arinagreavu at yahoo.com) from news text sources, and annotated with (a) PoS and morphology using our tagger, as well as (b) syntactic function and shallow dependency markers using a Constraint Grammar system at VISL (http://beta.visl.sdu.dk/constraint_grammar.html). You might get further information from Eckhard (eckhard.bick at mail.dk)

Best regards,



From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Olga Pustylnikov Sent: 5 februarie 2008 10:49 To: corpora at uib.no Subject: Re: [Corpora-List] Announcement: Release of the DependencyTreebankDatabase DTDB 1.0

Dear Sabine,

thank you for your response and for your helpful hints. Our goals behind the conversion into eGXL were mainly: 1. to get a unification by means of XML 2. to integrate the treebanks into a corpus management system where not only treebanks but as well spoken and web corpora are stored/retrieved by means of GXL (an XML based graph representation format).

Restricting on a one specific format is always bound to additional adaptations of your application when you have to deal with a new treebank. Thus, we tried to select a format which is generic enough to be reused and which is suitable for treebanks. GXL is a generic graph model which allows to represent any kinds of corpora, since you can represent any sorts of relations in terms of a graph. That makes GXL a useful means for corpus retrieval. Treebanks can easily be mapped to it (since trees are special cases of graphs). eGXL slightly modifies GXL in order to account for specifics of treebanks. Thus, we selected this format while it meets both requirements - to be generic and suitable for treebanks.

In my paper I don't provide a detailed comparison of eGXL to other formats. However, CoNNL is referred to by comparing the treebanks, although only indirectly. Please send me a reference to your work, which I've missed to mention in this paper and I will consider it in my future work.

Best regards,

On Feb 3, 2008 1:39 PM, Sabine Buchholz <sabine.buchholz at crl.toshiba.co.uk> wrote:

Dear Olga, I think uniform formats for treebanks are a good idea and therefore read your announcement, Wiki page and article with interest. However, that raised a lot of questions: You clearly are aware of the CoNLL-X shared task on multilingual dependency parsing, as you link to its home page from your Wiki. For that task 13 treebanks were converted to a uniform format, many of them among the 11 you list. Our goal was probably different from yours but 1) Why is that work not even mentioned in the paper, let alone compared to? 2) What part of the analyses you did for the paper could you not have done using the CoNLL-X format? You even seem to have used the CoNLL-X version of some treebanks (e.g. Dutch) as the basis of your eGXL conversion (the Dutch example in your paper is in CoNLL-X and not the original Alpino format). 3) Why did you choose to do that? The conversion from Alpino to CoNLL-X format looses some information, so why not convert from the original format? Same potentially for Swedish and Bulgarian.

With regard to your question about other treebanks to add to your database: in addition to the remainder of the 13 CoNLL-X treebanks and the new ones converted for the successor (the CoNLL 2007 shared task on dependency parsing), http://en.wikipedia.org/wiki/Treebank lists even more treebanks. But you probably already know that, you link to it from your Wiki... Although I just noticed that the Romanian treebank you used is still missing from that list...

Looking forward to hearing from you, kind regards, Sabine Buchholz

----- Original Message ----- From: Olga Pustylnikov To: corpora at uib.no Sent: Friday, February 01, 2008 9:31 AM Subject: [Corpora-List] Announcement: Release of the Dependency TreebankDatabase DTDB 1.0

Dear list members, I'm happy to announce the release of DTDB 1.0, a Dependency Treebank DataBase. The database consists of 11 languages which are transformed into a single representation format. This format is an XML based graph model, and it was designed to support the interoperability of existing corpora. The wiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/ presents the treebanks and the unification format used. Details about the format are also described in: http://ariadne.coli.uni-bielefeld.de/pustylnikov/pdfs/acl07.1.0.pdf My question is: do other treebanks exist which are not part of the database? If you know of an existing treebank that should be transformed into the unified format please, let me know.

-- Olga Pustylnikov

Universität Bielefeld Fakultät für Linguistik und Literaturwissenschaft Universitätsstraße 25 D-33615 Bielefeld

http://ariadne.coli.uni-bielefeld.de/pustylnikov/ olga.pustylnikov at uni-bielefeld.de

______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

-- Olga Pustylnikov

Universität Bielefeld Fakultät für Linguistik und Literaturwissenschaft Universitätsstraße 25 D-33615 Bielefeld

http://ariadne.coli.uni-bielefeld.de/pustylnikov/ olga.pustylnikov at uni-bielefeld.de

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. Host: valhalla.racai.ro Version: IMP 4.1.5 (H3) (Horde 3.1.5)

-- This message was scanned for spam and viruses by BitDefender. For more information please visit http://linux.bitdefender.com/

More information about the Corpora mailing list