[Corpora-List] Announcement: Release of the Dependency TreebankDatabase DTDB 1.0

Olga Pustylnikov olga.pustylnikov at uni-bielefeld.de
Tue Feb 5 09:49:08 CET 2008


Dear Sabine,

thank you for your response and for your helpful hints. Our goals behind the conversion into eGXL were mainly: 1. to get a unification by means of XML 2. to integrate the treebanks into a corpus management system where not only treebanks but as well spoken and web corpora are stored/retrieved by means of GXL (an XML based graph representation format).

Restricting on a one specific format is always bound to additional adaptations of your application when you have to deal with a new treebank. Thus, we tried to select a format which is generic enough to be reused and which is suitable for treebanks. GXL is a generic graph model which allows to represent any kinds of corpora, since you can represent any sorts of relations in terms of a graph. That makes GXL a useful means for corpus retrieval. Treebanks can easily be mapped to it (since trees are special cases of graphs). eGXL slightly modifies GXL in order to account for specifics of treebanks. Thus, we selected this format while it meets both requirements - to be generic and suitable for treebanks.

In my paper I don't provide a detailed comparison of eGXL to other formats. However, CoNNL is referred to by comparing the treebanks, although only indirectly. Please send me a reference to your work, which I've missed to mention in this paper and I will consider it in my future work.

Best regards,

On Feb 3, 2008 1:39 PM, Sabine Buchholz <sabine.buchholz at crl.toshiba.co.uk> wrote:


> Dear Olga,
> I think uniform formats for treebanks are a good idea and therefore read
> your announcement, Wiki page and article with interest. However, that
> raised
> a lot of questions:
> You clearly are aware of the CoNLL-X shared task on multilingual
> dependency
> parsing, as you link to its home page from your Wiki. For that task 13
> treebanks were converted to a uniform format, many of them among the 11
> you
> list. Our goal was probably different from yours but
> 1) Why is that work not even mentioned in the paper, let alone compared
> to?
> 2) What part of the analyses you did for the paper could you not have done
> using the CoNLL-X format?
> You even seem to have used the CoNLL-X version of some treebanks (e.g.
> Dutch) as the basis of your eGXL conversion (the Dutch example in your
> paper
> is in CoNLL-X and not the original Alpino format).
> 3) Why did you choose to do that? The conversion from Alpino to CoNLL-X
> format looses some information, so why not convert from the original
> format?
> Same potentially for Swedish and Bulgarian.
>
> With regard to your question about other treebanks to add to your
> database:
> in addition to the remainder of the 13 CoNLL-X treebanks and the new ones
> converted for the successor (the CoNLL 2007 shared task on dependency
> parsing), http://en.wikipedia.org/wiki/Treebank lists even more treebanks.
> But you probably already know that, you link to it from your Wiki...
> Although I just noticed that the Romanian treebank you used is still
> missing
> from that list...
>
> Looking forward to hearing from you,
> kind regards,
> Sabine Buchholz
>
>
> ----- Original Message -----
> From: Olga Pustylnikov
> To: corpora at uib.no
> Sent: Friday, February 01, 2008 9:31 AM
> Subject: [Corpora-List] Announcement: Release of the Dependency
> TreebankDatabase DTDB 1.0
>
> Dear list members,
> I'm happy to announce the release of DTDB 1.0, a Dependency Treebank
> DataBase. The database consists of 11 languages which are transformed into
> a
> single representation format. This format is an XML based graph model, and
> it was designed to support the interoperability of existing corpora.
> The wiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/ presents
> the treebanks and the unification format used. Details about the format
> are
> also described in:
> http://ariadne.coli.uni-bielefeld.de/pustylnikov/pdfs/acl07.1.0.pdf
> My question is: do other treebanks exist which are not part of the
> database?
> If you know of an existing treebank that should be transformed into the
> unified format please, let me know.
>
> --
> Olga Pustylnikov
>
> Universität Bielefeld
> Fakultät für Linguistik und Literaturwissenschaft
> Universitätsstraße 25
> D-33615 Bielefeld
>
> http://ariadne.coli.uni-bielefeld.de/pustylnikov/
> olga.pustylnikov at uni-bielefeld.de
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>

-- Olga Pustylnikov

Universität Bielefeld Fakultät für Linguistik und Literaturwissenschaft Universitätsstraße 25 D-33615 Bielefeld

http://ariadne.coli.uni-bielefeld.de/pustylnikov/ olga.pustylnikov at uni-bielefeld.de -------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080205/b41ae941/attachment.html



More information about the Corpora mailing list