Along these lines, while there are some large XML-based corpora out there, it seems like most of these end up using some type of large-scale hashes (or something of the sort) to make the corpus usable and searchable. In other words, for these large corpora, I don't think that anyone really searches the raw XML files themselves. There are some really fancy (and "elegant") XML architectures out there, but it seems like in the real world there is a serious problem with scalability; hence the hybrid approach.
Most really large corpora that I'm aware of do use a relational database architecture, including systems like IMS Corpus Workbench. With corpora that I've created (see http://corpus.byu.edu), I've gone from 100 million words to 360+ million words and have seen little if any performance hit -- it's still three seconds or less for most queries, including queries that involve word form, lemma, part of speech, synonyms, limiting by and comparing across genres, etc. The relational database architecture really is quite scalable.
Just my .02 worth.
============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906 Web: davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================ From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of fatima zuhra Sent: Saturday, April 26, 2008 10:56 PM To: Hardie, Andrew Cc: corpora at uib.no Subject: Re: [Corpora-List] Corpus Development
Thanks for your e-mail having valuable suggestions for me. I'll indeed act on your advice to enhance the corpus. Well, I have been working with Xaira for a few days, and I have found that a very useful tool.
Well Sir, I would like to ask, what were the factors due to which you preferred the use of e.g. SQL for larger corpora i.e. in case of Urdu, Nepali etc? What do you say, isn't XML better for larger corpora? If not, then why Sir?
"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote: Dear Fatima,
I am sure others will have responded to your queries, but I thought I'd add my voice. For the kind of data you describe, Xaira is indeed a good option. the web addresses you need are:
http://www.oucs.ox.ac.uk/rts/xaira/ http://www.natcorp.ox.ac.uk/tools/ http://sourceforge.net/projects/xaira/ http://xaira.sourceforge.net/
However, when you have a larger corpus, you might also consider whether a web-accessible solution (e.g. one based on an SQL database) would be more convenient. I have found this to be the case when working with corpora of Urdu, Nepali, Sinhala etc.
In terms of your future research, I would recommend working primarily on expanding your corpus. 30,000 words is not a lot of data in corpus terms. You will find, I think, that effort spent enhancing your corpus collection will be much more fruitful than developing software, especially given how much ready-made corpuys analysis software is freely available.
Andrew Hardie Linguistics & English Language Bowland College Lancaster University Lancaster LA1 4YT United Kingdom
________________________________________ From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of fatima zuhra Sent: 19 April 2008 03:25 To: Corpora at uib.no Subject: [Corpora-List] Corpus Development Hi All,
Thanks a lot to all, who paid attention to my message and provided me with their valuable suggestions.
Dear Laxmi, my corpus is a general-purpose corpus of written Pashto. Dear Mr. Adam, the corpus currently contains 30,000 words and its size is increasing. I haven't used Xiara, but am interested in using it. Dear Lou, I'll be too much thankful to you if you help me further by forwarding me some guidelines about Xiara. The web page http://www.xaira.net/ cannot be displayed in my browser.
Dear Gee Raza, I am also glad to see someone from Pakistan on the list. Well, I only know the three languages, you have mentioned, but am interested in learning Arabic and Persian. I hope I'll soon learn these two.
Dear Oliver, I meant to ask that am I going in a right direction for a general-purpose Pashto corpus? By fully functional, I mean something that can be rightly called a corpus. I also wanted to investigate the appropriate statistical measures, which can be used for the evaluation of any newly developed software. In our country, there are statisticians, who know each and every statistical measure, but cannot guide us which one to use for which purpose. If there are some, who can guide, we do not have access to them.
Thanks to Sir Ramesh for his encouragement and valuable suggestions.
I have also developed a finite state morphological analyzer for Pashto. I will provide the details from time to time.
Regards. ________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.