[Corpora-List] SQL -- thanks, and preliminary results on tagging a 20m word corpus
Mark_Davies at byu.edu
Fri Jul 13 01:42:51 CEST 2007
Thanks to all of you who sent suggestions on the SQL statements. Piecing
together bits from all of the suggestions, I was able to create and
update the necessary tables.
I used the data from a 20 million word tagged corpus of Spanish to
create 1, 2, and 3-gram tables (words/POS/lemma for each "slot"), and
then ran queries to match these up with 2-grams, 3-grams, etc in the
untagged corpus. I first used the 3-grams table (tagging the middle
word, with one word of context on each side), then the still untagged
2-grams (one word of context to the left, and then to the right), etc. I
was able to tag the 20m word corpus in about 20 minutes total, once the
n-grams tables were set up.
It seemed to work quite well -- at least at first glance -- looking at
15-20 particularly problematic words in Spanish. While this approach
certainly wouldn't be the last step in tagging a text, it may be a way
to get things in shape for more sophisticated (and probably
I've got in an n-grams relational database of the info for the 100m word
British National Corpus (word+POS; same info used for
http://corpus.byu.edu/bnc), and may try this for English as well, by
applying the BNC data to an untagged corpus of English.
In summary, while there are certainly many approaches to tagging a
corpus, this relational database approach appears to have some merit as
Thanks again for the input.
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
More information about the Corpora