[Corpora-List] English corpus for specific domains (coming soon....)

Mark Davies Mark_Davies at byu.edu
Thu Nov 13 22:37:02 CET 2014


Liling,


>> Are there corpora that are specifically for the following domains:

In about 5-6 weeks I'll be releasing a corpus that is based on the 2 billion words (4.5 million articles) in Wikipedia, which should do most of what you want. Via the web interface, you'll be able to quickly and easily create "virtual corpora" from the 4.5 million articles, based on titles, page links, and/or page content. Each of these virtual, personalized corpora can have up to 1,000 articles and 1.2 million words.

And then you'll be able to search within these virtual corpora (strings, n-grams, collocates, collocations, concordances, etc) , or compare word and phrase frequencies across your virtual corpora, or find keywords (including multi-word expressions) in your corpora, all from within the web interface and all within just a few seconds.

Anyway, the corpus (and interface) is essentially done now, but I'm just working on the help files, including some tutorials that I'll place on YouTube.

So this may be of interest to you when I release it in just a few weeks.

Best,

Mark Davies

============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

________________________________ From: corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of liling tan <alvations at gmail.com> Sent: Thursday, November 13, 2014 7:33 AM To: corpora at uib.no Subject: [Corpora-List] English corpus for specific domains

Dear linguists,

Traditional corpora such as British National Corpus, American COCA corpus and International Corpus of English holds on to the notion of a balance corpus and allowed corpora of different registers, domains and types.

Web corpora like wikipedia corpora, web as corpus corpora and many others used crawling techniques or crowdsourcing texts for compilation and it also ends up with some sort of balance corpora.

Thus finding corpora for specific domains is a task of resourcefulness. And we require your help in locating them.

Are there corpora that are specifically for the following domain:

* Chemical: the taxonomy rooted on "chemical", examples of terminology concepts are ("ammonium carbonate", "beta hydroxybutyric acid", "butyl rubber" );

* Equipment: the taxonomy rooted on "equipment", examples of terminology concepts are ("acoustic modem", "parasail", "clock pendulum");

* Food: the taxonomy rooted on "food", examples of terminology concepts are ("jacket potato", "lemonade", "bolognese pasta sauce");

* Science: the taxonomy rooted on "science", examples of terminology concepts are ( "neuropsychiatry", "craniometry", "microelectronics");

Best Regards, Liling -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5040 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141113/d1e9b1d5/attachment.txt>



More information about the Corpora mailing list