[Corpora-List] New BNC-related corpus: register-based queries and "fuzzy matches"
Mark_Davies at byu.edu
Mon Nov 22 14:19:05 CET 2004
I have placed on the web a freely-accessible resource that may be of interest to some of you:
([V]ariation [I]n [E]nglish [W]ords and Phrases)
As with some other interfaces, this website allows you to quickly and easily search the 100 million word British National Corpus. Users can search by exact word or phrase, wildcard or part of speech, or combinations of these (e.g. all nouns ending in -ness or all cases of "white" + [noun]).
Unlike some interfaces that are strictly "slot-oriented", this interface also allows you to use "anchors" and "targets" for fuzzy matches (e.g all nouns somewhere near "break" (v), adjectives near "woman", verbs near "way", and nouns near "small"), and the size of the window can be easily customized.
Perhaps the most unique aspect of the corpus is the ability to find the frequency of words and phrases in any combination of registers that you define (spoken, academic, poetry, medical, etc). In addition, you can compare between registers -- for example, verbs that are more common in legal or medical texts, phrases like [I * that] that are more common in conversation than in non-fiction texts, nouns near "break" (v) that are found primarily in academic writings, etc.
Finally, it should be noted that the database architecture of this corpus improves on some previous interfaces, in that it allows users to find *all* of the matching strings from the BNC, rather than just those n-grams that occur three times or more in the corpus (which effectively cuts out about 75% of all 2-gram and 3-gram strings). It's also quite fast -- just a couple of seconds or less for nearly all searches -- including queries with detailed register information.
If you have any questions, please feel free to email me.
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
** Corpus design and use // Web-database scripting **
** Historical linguistics // Functional-typological grammar **
** Variation in Spanish, Portuguese, and English syntax **
More information about the Corpora-archive