The Brown Corpus

Leech, Geoffrey g.leech at lancaster.ac.uk
Mon Jun 20 12:40:01 CEST 2005


Anent Lou's remark, Nick Smith and I are also working on an XMLified version of Brown, together with comparable versions of Frown, LOB and FLOB. They will be tagged with the C8 tagset, an enriched version of the richer of the two tagsets used to tag the BNC (C7), and will be available through the OTA and also through the new ICAME Collection CD Knut is planning. The tagging has already been done, but (as always) there's still some tidying up to be done. The idea is to be able to make precise comparisons between the tagged versions of the four matching corpora. Christian Mair (Freiburg) and Marianne Hundt (Heidelberg) have been collaborating with us on this for a few years.

Re the issue of 500 2000-word text files vs the 15 text category files, of course it's easy enough to have and use the two versions as alternatives, as we've been doing with our students for some time.

Geoff Leech

-----Original Message-----
From: corpora-archive-admin at uib.no
[mailto:corpora-archive-admin at uib.no]On Behalf Of
corpora-archive-request at uib.no
Sent: 17 June 2005 14:01
To: corpora-archive at uib.no
Subject: Corpora-archive digest, Vol 1 #154 - 11 msgs



Today's Topics:

1. [Corpora-List] Another "Search Inside" tool: Google Print... (Ute Römer)
2. [Corpora-List] CfP: EUROLAN 2005 Workshop "ROMANCE FrameNet" (Vincenzo Pallotta)
3. [Corpora-List] work on corpus linguistics written in german (Scherer, Carmen)
4. [Corpora-List] 2nd CfP: Computer Treatment of Slavic and East European Languages
- Slovko 2005 (Alexander Horak)
5. [Corpora-List] Japanese/English aligned corpora? (Jim Breen)
6. [Corpora-List] Final CFP: ECCB'05 Workshop on Biomedical Ontologies and Text
Processing (George Demetriou)
7. [Corpora-List] Brown Corpus (Lou Burnard)
8. [Corpora-List] Brown Corpus (Adam Kilgarriff)
9. [Corpora-List] Brown Corpus (Jean Veronis)

--__--__--

Message: 1
From: Ute Römer <ute.roemer_AT_anglistik.uni-hannover.de>
To: "'David Oakey'" <d.j.oakey_AT_bham.ac.uk>, <CORPORA_AT_UIB.NO>
Subject: [Corpora-List] Another "Search Inside" tool: Google Print...
Date: Thu, 16 Jun 2005 13:40:40 +0200
Reply-To: corpora-archive_AT_uib.no

Dear all,

David's message (Thanks, David! I didn't know about the update) reminded me
of a related search tool which might also be of interest for some of you
(maybe you know about it already, but I only discovered it a few weeks ago):
Google Print (check http://print.google.com/ and
http://print.google.com/googleprint/about.html). The system allows you to
search the full text of a huge number of books (apparently, they collaborate
with publishers and libraries; they don't say how many books have been
scanned and uploaded so far though) and gives you selected pages from those
books which contain your search string. It's not so much a concordancing
facility but certainly a new way of doing (literature) research.

A search for "corpus linguistics", for instance, retrieves 3,040 hits (with
the Biber/Conrad/Reppen 1998 textbook topping the list);
http://print.google.com/print?ie=UTF-8&q=%22corpus+linguistics%22&btnG=Searc
h. You can then follow a link and separately search within each of the
"corpus linguistics" books. For example, you find that "register variation"
occurs on 45 different pages in Biber/Conrad/Reppen 1998, and there are
links that take you to the scanned image of each of the relevant pages (with
the search item highlighted). That option is also very useful when you need
to check the page number of a quote and don't have the book at hand. You can
also see which library near you has this book -- and, of course, where you
can buy it.

Best wishes... Ute


********************************************

Ute Römer
English Department
University of Hanover
Königsworther Platz 1
30167 Hannover
Germany

Phone: +49 (0)511 762 2997
Fax: +49 (0)511 762 2996
E-mail: ute.roemer_AT_anglistik.uni-hannover.de
http://www.uteroemer.de
http://www.fbls.uni-hannover.de/angli/


> -----Original Message-----



> From: owner-corpora_AT_lists.uib.no [mailto:owner-corpora_AT_lists.uib.no] On

> Behalf Of David Oakey

> Sent: Thursday, June 16, 2005 12:06 PM

> To: CORPORA_AT_UIB.NO

> Subject: [Corpora-List] Additions to amazon.com "Search Inside" feature

>

> Apologies if I'm be reporting something that everyone already knows

> about except me, but Amazon.com's "Inside this book" feature now

> provides - for all books in its "Search Inside" scheme - a concordance

> (in the sense of a frequency list rather than KWIC citations), text

> statistics, and statistically improbable phrases (SIPs). A SIP works a

> bit like an n-gram version of a keyword in Wordsmith Tools, with the

> reference corpus being all the books in Amazon's "Search Inside" corpus.

> If Amazon finds "a phrase that occurs a large number of times in a

> particular book relative to all Search Inside books, that phrase is a

> SIP in that book." On the shopping page for the book "Into the void with

> Ace Frehley," (the notoriously spaced former guitarist in the rock band

> KISS) for example, the SIP they list is "black nail polish". This is

> impressive - and not at all improbable - if you know much about the

> career of Ace Frehley.

>

> The concordance results are presented alphabetically, with more frequent

> words shown in a larger font size. Text statistics include standard

> readability indices (the Fog Index seems apt here) and they have a "fun

> stats" section where they calculate words per dollar and words per ounce

> (words per pound and words per kilo on amazon.co.uk). More information

> on the Amazon site about the number of books in the scheme (yes, 120,000

> books, 33 million pages etc., but that was nearly 2 years ago), their

> subject areas, authorship details etc. would of course be useful. While

> this is intended as a marketing feature (it "allows you to search

> millions of pages to find exactly the book you want to buy"), I believe

> it would be interesting to corpora list members in itself.

>

> Best wishes,

>

> David Oakey

> ------------------------------



> Lecturer in English Language

> English for International Students Unit

> University of Birmingham, UK

> phone: + 44 121 4145703

> email: d.j.oakey_AT_bham.ac.uk

> http://www.eisu.bham.ac.uk/staff/oakeydavid.htm

> ------------------------------







--__--__--

Message: 2
Date: Thu, 16 Jun 2005 15:51:14 +0200
From: Vincenzo Pallotta <Vincenzo.Pallotta_AT_epfl.ch>
Organization: Swiss Federal Institute of Technology - Lausanne
To: corpora_AT_hd.uib.no
Subject: [Corpora-List] CfP: EUROLAN 2005 Workshop "ROMANCE FrameNet"
Reply-To: corpora-archive_AT_uib.no

*** CALL FOR PAPERS ***

ROMANCE FrameNet

Workshop and Kick-off Meeting
26 - 28 July, 2005


a satellite event of the EUROLAN 2005 Summer School
http://www.cs.ubbcluj.ro/EUROLAN2005


Babes-Bolyai University, Cluj-Napoca, Romania
25 July - 6 August, 2005



*** Backgrounds and Goals

ROMANCE FrameNet (http://ic2.epfl.ch/~pallotta/rfn) is a joint
initiative to create a special interest group with the goal of building
a multi-lingual FrameNet resource for romance languages based on
Fillmore's Frame Semantics (1977, 1982). ROMANCE FrameNet is different
from Multi-Lingual WordNet since it will link collocational and
constructional material to word senses (not only sub-categorization
frames of verbs). Applications to foreign language learning are
apparent, as well as for computer assisted translation, multi-lingual
information extraction and cross-lingual question answering.

We consider the major 6 Romance languages, namely French, Spanish,
Italian, Romanian, Portuguese, Catalan, but we do not exclude other
Romance languages whose representatives are equally invited to join the
ROMANCE FrameNet initiative.

We suggest to adopt a new methodology for the development of ROMANCE
FrameNet by translating the sentences annotated in the original FrameNet
project from support corpora (e.g. BNC). This approach is inspired by
the recent work on MultiSemCor. The translation task will be distributed
among the participants and supervised by local teams in the
participating institutions working on the different languages.

ROMANCE FrameNet is directed towards the following goals:

1. creating a consistent aligned and frame-annotated multi-lingual
corpus;
2. highlighting cross-language regularities, and structural intra-

and extra-typological idiosyncrasies;
3. creating a semantically indexed translation memory and an inverse
multi-lingual dictionary;
4. creating one of the first freely available resource which contains
cross-languages sub-categorization and collocational mappings;
5. reusing the work done on automatic role assignment and semantic
parsing (cf. Senseval-3).


*** Workshop and Kick-off Meeting

We propose to meet during the EUROLAN 2005 Summer School in Cluj-Napoca
in order to develop an informal agreement on a work plan for
bootstrapping the ROMANCE FrameNet resource.

As a first shared task for bootstrapping the project, we propose the
translation and annotation of a common subset of 110 sentences from the
English FrameNet data in each of the targeted languages. The burden of
this work will be taken by the participants to the workshop and kick-off
meeting, noting that there may be multiple translations for the same
sentence. We consider this redundancy very useful for evaluating
inter-translator agreement.

The mini-corpus will be the basis of the discussion during the kick-off
meetings and possibly of the workshop papers. It is available for
download at: http://ic2.epfl.ch/~pallotta/rfn/data.zip.

FrameNet annotations can be found by browsing the FrameNet data
(http://framenet.icsi.berkeley.edu/) locally or by using the Sato's
FrameSQL browser
(http://sato.fm.senshu-u.ac.jp/fn22/notes/fullMenuFrame.html).

The format of the ROMANCE FrameNet workshop and kick-off meeting will be
of one hour of daily official public presentations and discussions (from
18.30 to 19.30) during three consecutive days (July, 26, 27 and 28).

Additionally we expect and foster a number of spontaneous informal
gatherings and working groups involving participants (either within the
same language or cross-language) during day-long meetings at EUROLAN school.


*** Paper Submission

We invite papers on several aspects of the construction of the ROMANCE
FrameNet resource. Topic of interest include but are not limited to:

* Creation of aligned multi-lingual corpus for ROMANCE FrameNet
* Transfer of Frames between English and Romance Languages
* Transfer and Adaptation of Frame annotations
* Cross-language similarities and differences in lexical choice for
translating the lexical units
* Cross-language similarities and differences in sub-categorization
and selectional restrictions
* Transfer of collocations and idiomatic expressions
* Automatic methods for alignment of multilingual resources in
perspective of Frame annotation
* Applications of multilingual FrameNet
* Evaluation (inter-translators agreement, annotation transfer, etc.)

Extended versions of a selection of the best papers will be published in
a special issue "Romanian Journal of Information Science and
Technology", published by the Romanian Academy Publishing House (ISSN:
1453-8245). The issue will be printed as post-conference proceedings.


*** IMPORTANT DATES

SUBMISSION DEADLINE: 30th June 2005

Notification of Acceptance: 10th July 2005

Camera-ready Papers: 15th July 2005

Workshop: 26-28 July 2005


*** Paper Requirements

Authors are invited to submit a 6-10 pages paper in electronic form (pdf
only) by 30th of June 2005. Like in the case of other EUROLAN workshops
over the past years, the review process is not blind. Authors of
accepted papers should submit the final version in electronic format not
later than 15th of July. The final version must be also in pdf format.

For the papers selected for publication in the ROJIST journal, we
require a Latex file (all macros used should be included; both Emtex and
Latex2e are allowed; the standard "article" style is strongly
recommended). All illustrations must be of professional quality and
should be sent in separate files in bmp format. The Abstract,
Introduction and Conclusion chapters are requested. References should be
listed in alphabetical order.

A sample paper is available here:
http://www.ceid.upatras.gr/Balkanet/journal/7_Overview.pdf

No galley proofs are sent to authors, the paper is printed in the final
form received from the authors after the completion of the refereeing
process. For each published paper 25 reprints are free of charge.

We suggest to comply with the ROJIST guidelines from the beginning in
the submission of the workshop paper.
All the papers should be sent both to Vincenzo Pallotta and Dan Tufis.


*** Registration

People attending the ROMANCE FrameNet workshop are warmly invited to
participate to the EUROLAN 2005 Summer School by registering here:
http://www.cs.ubbcluj.ro/EUROLAN2005/index.php?r=Registr (Note that
authors of the papers accepted for presentation at the workshop will
benefit of early registration fee regardless of the date they register).

Participation to the workshop is open to all EUROLAN 2005 attendants and
is included in the school's participation fee. Copies of workshop
proceedings will be included in the EUROLAN school's CD-ROM.

In alternative, it is possible to register only for the workshop with a
fee of 50 euro to be payed on the workshop site.


*** Scientific Committee

Collin Baker (FrameNet, ICSI, Berkeley, USA)
Dan Cristea (University of Iasi, Romania)
Rodolfo Delmonte (University of Venice, Italy)
Charles J. Fillmore (FrameNet, ICSI, Berkeley, USA)
Thierry Fontenelle (Microsoft Research, USA)
Rada Mihalcea (University of of North Texas, USA)
Vincenzo Pallotta (EPFL, Switzerland)
Carlos Subirats (Autonomous University of Barcelona, Spain)
Violeta Seretan (University of Geneva, Switzerland)
Amalia Todirascu (University of Strasbourg, France)
Dan Tufis (Romanian Academy, Bucharest, Romania)
Nancy Ide (Vasar College, USA)
Alessandro Lenci (University of Pisa, Italy)
Francesca Bertagna (University of Pisa, Italy)
Emanuele Pianta (ITC-IRST, Trento, Italy)
Berardo Magnini (ITC-IRST, Trento, Italy)
Pierrette Bouillon (University of Geneva, Switzerland)
Dominique Dutoit (Memodata, France)
Mercé Lorente (University of Barcelona, Spain)
Aline Villavicencio (University of Essex, UK)


*** Organization

The ROMANCE FrameNet Workshop and Kick-off Meeting is part of the
EUROLAN 2005 Summer School and it is organized by:

Vincenzo Pallotta (EPFL, Lausanne, Switzerland)
Dan Tufis (Romanian Academy, Bucharest, Romania)

For further information please contact:

Vincenzo Pallotta: Vincenzo.Pallotta_AT_epfl.ch
Dan Tufis: tufis_AT_racai.ro

This document is also available on-line at:

http://ic2.epfl.ch/~pallotta/rfn/








--__--__--

Message: 3
Subject: [Corpora-List] work on corpus linguistics written in german
Date: Thu, 16 Jun 2005 15:40:30 +0200
From: "Scherer, Carmen" <cscherer_AT_uni-mainz.de>
To: <corpora_AT_uib.no>
Reply-To: corpora-archive_AT_uib.no

This is a multi-part message in MIME format.



Dear all,

I am working on a textbook on corpus linguistics for German undergraduates. For the further reading section I tried to locate papers and books written in German suitable for undergraduates but didn't find much (e.g. Sinclair 1998, Carstensen et al. 2001, Mukherjee 2002). I would be grateful for any further hint.

Best regards, Carmen Scherer

---

Dr. phil. Carmen Scherer
Diplom-Betriebswirtin (BA)
Johannes Gutenberg-Universität Mainz
FB 05, Deutsches Institut

D - 55099 Mainz

Tel.: +49 (0)6131-39-23365
Fax: +49 (0)6131-39-23366

Homepage: http://www.germanistik.uni-mainz.de/linguistik/mitarbeiter/scherer/scherer.php








--__--__--

Message: 4
Date: Thu, 16 Jun 2005 17:09:58 +0200
From: Alexander Horak <alexh_AT_korpus.juls.savba.sk>
To: corpora_AT_uib.no, nlp-l_AT_uci.agh.edu.pl,
sc-lista_AT_poincare.matf.bg.ac.yu, Forum_JS_AT_yahoogroups.com
Subject: [Corpora-List] 2nd CfP: Computer Treatment of Slavic and East European Languages
- Slovko 2005
Reply-To: corpora-archive_AT_uib.no

Second Call for Papers

##########################################################
Third International Seminar
Computer Treatment of Slavic and East European Languages
Slovko 2005
Ľ. Štúr Linguistics Institute (Slovak Academy of Sciences)
Faculty of Education (Comenius University)
10âEUR"12 November 2005, Bratislava, Slovakia
http://korpus.juls.savba.sk/~slovko
##########################################################

DESCRIPTION
The seminar will provide a meeting point for people working
on various aspects of the relationship between languages and
computers.With this broad thematic framework in mind, papers
are invited describing activities concerning any Slavic or
East European language, as well as those dealing with bi- and
multi-lingual projects involving at least one Slavic or Central
European language.

TOPICS
The topics may include âEUR" but are not limited to:
Tools for linguistic text analysis
Creation and use of language resources
Linguistic components of information systems
Linguistic databases
Speech analysis and synthesis
Computer-aided translation, localization and lexicography
Computer-aided language learning

KEYNOTE SPEAKER
Josef Psutka (University of Western Bohemia, Pilsen)

PROGRAM COMMITEE
FrantiÅ¡ek ÄOEermák (Charles University, Prague)
Jan Hajič (Charles University, Prague)
Vladimír Petkevič (Charles University, Prague)
Karel Pala (Masaryk University, Brno)
Ivan Kopeček (Masaryk University, Brno)
Adam Przepiórkowski (Polish Academy of Sciences, Warsaw)
Tamás Varádi (Hungarian Academy of Sciences, Budapest)
Marko TadiÄ? (University of Zagreb, Zagreb)
Milan Rusko (Slovak Academy of Sciences, Bratislava)
Slavo Ondrejovič (Slovak Academy of Sciences, Bratislava)
Alexandra Jarošová (Slovak Academy of Sciences, Bratislava)
Peter ÄZurčo (University of st. Cyril and Methodius, Trnava)

ORGANIZING COMMITEE
Mária Šimková
Radovan Garabík
Vladimír Benko
Alexander Horák

SUBMISSION INFORMATION
The papers may be presented in any Slavic language as well as in
English. The language for submitting a paper is English.
If you intend to present a paper, please send an abstract in a plain
text file by June 30, 2005. The deadline for camera-ready version
of the paper is September 12, 2005.
The authors are strongly encouraged to write their papers in LaTeX
format, using the llncs2e document class. Please refer to the
instructions found on Springer website
http://www.springer.de/comp/lncs/authors.html#Proceedings
Alternatively, if you prefer to work in WYSIWIG environment,
you can submit your article in OpenOffice 1.1 format, with standard
fonts. If you need to use any special fonts, make sure they are
Truetype fonts with correct unicode table, and include the font with
your article. As a last resort, authors can submit their articles
using MS WORD (at least version 97), but they must use the LNCS
template for WORD, and be aware that the article will be converted
into OpenOffice, with any eventual loss of formatting being a sole
responsibility of the author. A PDF or PostScript version with all
fonts embedded should be enclosed with the paper.
We expect proceedings to be published in 2006. Proceedings from
the previous seminar Slovko03 will be published prior to the beginning
of the seminar.

IMPORTANT DATES
Abstract submission deadline: June 30, 2005
Notification of acceptance: July 30, 2005
Camera-ready papers due: September 15, 2005
Slovko05 conference: November 10 âEUR" 12. 2005

CONFERENCE VENUE
Center for advanced studies
University of Economics
Palisády St. 22
811 06 Bratislava 1, Slovakia

ACCOMODATION
Center for advanced studies, Palisády 22
2 appartmens available á 1200 SKK (30â'¬)

Center of university services, Konventná 1 (3 minutes from the
conference venue)
2âEUR"4 bed rooms á 500 SKK (13â'¬)
1 appartment á 1000 SKK (25â'¬)

CONFERENCE FEES
The seminar fee is planned to be 50 â'¬ (students 20 â'¬) and includes one
copy of the proceedings, refreshment, social event and organizing costs.
The conference fee should be payed before July 30, 2005. After this date
the conference fee is 50% higher.

Bank transfer to:
Narodna banka Slovenska
Imricha Karvasa 1
813 25 Bratislava
SWIFT Code: NBSB SKBX
Account Number: SK23 8180 0000 0070 0000 6607

Note that the payment for the accomodation will be done at the beginning
of the conference.


REGISTRATION
If you intend to take part in the seminar, please fill in the registration
form and send it (by e-mail, fax or snail-mail) to the address:

Alexander Horák
Department of Slovak National Corpus
Ľ. Štúr Linguistics Institute
Panská 26, 813 64 Bratislava, SLOVAKIA
phone:+421-2-54410304
fax:+421-2-54410307
<alexh_AT_korpus.juls.savba.sk>




--__--__--

Message: 5
Date: Fri, 17 Jun 2005 11:50:38 +1000 (EST)
From: Jim Breen <Jim.Breen_AT_infotech.monash.edu.au>
Subject: [Corpora-List] Japanese/English aligned corpora?
To: corpora-archive_AT_uib.no
Reply-To: corpora-archive_AT_uib.no

Michiel Kamermans <mkamerma_AT_science.uva.nl> asked:

>>

>> I'm looking for aligned corpora of English/Japanese texts, as well as

>> formal Japanese/colloquial Japanese texts for a graduation project on a

>> more natural English-Japanese machine translation system. I hope someone

>> (or maybe multiple people?) on the list might know where such corpora

>> may be found.


You may be aware already of the collection of ~160,000
Japanese/English sentence pairs collected by the late Yasuhito Tanaka
some years ago. I am (sort of) custodian of an edited version of this
which I have linked into the WWWJDIC online dictionary. (The sentences
are linked to the dictionary at the (Japanese) word level, and the
collection can be searched online using regular expressions.)

You can read about the collection at
http://www.csse.monash.edu.au/~jwb/wwwexampinf.html and
http://www.csse.monash.edu.au/~jwb/wwwjdicinf.html#examp_tag The
file can be downloaded from that site too. It's Public Domain.

The collection is slowly being improved as the typos are removed and
near duplicates removed. I'm not sure I'd trust it too much to train
an MT system.

Jim

--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) $B%8%`!&%V%j!<%s(B_AT_$B%b%J%7%eBg3X(B



--__--__--

Message: 6
Date: Fri, 17 Jun 2005 10:34:01 +0100
To: corpora_AT_uib.no
From: George Demetriou <demetri_AT_dcs.shef.ac.uk>
Subject: [Corpora-List] Final CFP: ECCB'05 Workshop on Biomedical Ontologies and Text
Processing
Reply-To: corpora-archive_AT_uib.no

[Apologies for multiple postings]

Final CALL FOR PAPERS

ECCB'05 WORKSHOP ON
BIOMEDICAL ONTOLOGIES AND TEXT PROCESSING

28 September, 2005
Madrid, Spain

http://www.nlp.shef.ac.uk/eccb05-ont+text

The workshop is part of the 4th European Conference on Computational
Biology (ECCB) (http://www.eccb05.org)

Hosted by: Bioinformatics National Institute (INB)

**************** Submission deadline 20 June, 2005 *********************

Please note e-mail address for submission: eccb05-ont+text_AT_dcs.shef.ac.uk


WORKSHOP DESCRIPTION
====================

Biomedical literature, bio-databases and bio-ontologies all play an
important role in supporting the work of biological researchers. Much
of the biological knowledge in our community is held in electronic
form as natural language text. However, not all experimental data is
appropriate to include in such research publications, and so is
instead stored in more structured bio-databases. Bio-ontologies
provide a common conceptual framework for structuring and annotating
this data to enable it to be pooled across databases. These three
resources contain overlapping information in different forms, and the
inter-dependencies between them are complex.

Text mining of biomedical literature is one way to ensure that the
large quantity of information in text is better reflected within
ontologies and databases. It can be used, for example, to add ontology
based annotation to bio-database entries. By exposing the vocabulary
and relationships within the literature, it can also assist in the
construction, refinement and validation of the ontologies themselves.
Even when used in isolation, the meaning of concepts within an
ontology must be interpreted by humans as well as computer systems.
Natural language, therefore, plays a vital role in ontology design.

Ontologies in turn can support text mining by for example: (i)
providing a framework for structuring terminologies and for clustering
synonyms; and (ii) defining the types of entities and relations that
text mining aims to discover during the process of analysing text.

Therefore text mining and ontologies have a lot in common and can be
mutually beneficial. However, bio-ontologies are frequently built
without explicitly taking into account the needs of the language
processing community. As a consequence language processing researchers
either ignore these valuable resources or are forced to adapt them
with difficulty. Furthermore, ontology builders are frequently
unaware of language processing tools, methodologies and applications
and how they might assist in the construction and evaluation of
ontologies.

The goal of this workshop is to bring together researchers from the
bio-ontology community with those from the biomedical text processing
community with a view to furthering their understanding each other's
needs and capabilities. Previous workshops in the area have tended
focus more either on bio-ontologies or on bio-text processing. While
some research has attempted to bridge this gap the aim of current
workshop is to focus explicitly on the relationship between
bio-ontologies and bio-text processing.

To that end we solicit papers that address any aspect of the
relationship between bio-ontologies and biomedical text processing.
Possible topics include, but are not limited to:

- Ontology-assisted information retrieval or extraction from biomedical text
collections
- Language processing techniques and principles for
building and maintaining bio-ontologies
- The relation between bio-ontologies and bio-lexicons and
more generally the relation between ontologies and natural language
- The role of isa and part-whole relations in bio-ontologies and
their relation to the lexical relations of hyponymy and meronymy
- The inclusion in biological databases of ontologically structured
information automatically or semi-automatically extracted from the
literature (aka curation)
- The evaluation of bio-ontologies through their use in language
processing applications
- The use of bio-ontologies for the creation of annotated language
resources (e.g. annotating texts with GO codes)
- The use of bio-ontologies to support co-reference resolution in
biomedical texts

While the goal of the workshop is to focus on the relationship between
bio-ontologies and biomedical text processing, excellent papers that
address one or the other of these areas to the exclusion of the other
will be considered at the discretion of the programme committee.

The workshop will include paper presentations and discussion. The
papers should describe recent and previously unpublished work and may
be preliminary in nature. The programme committee will arrange the
presentations and discussion based on the quality of submissions and
may invite other presentations as well. See
http://www.nlp.shef.ac.uk/eccb05-ont+text for further details.

Abstracts of the workshop papers will be published in the main ECCB05
conference proceedings and the papers themselves will be published in
a separate workshop proceedings. Negotiations are underway for a
journal special issue in which the best papers from the workshop will
be published.

IMPORTANT DATES
===============

Paper submission: June 20, 2005
Acceptance notification: July 15, 2005
Final papers due: July 22, 2005
Workshop: September 28, 2005

SUBMISSION INSTRUCTIONS
=======================

Position papers should be no more than 4000 words (5-8 pages). The
standard ACM conference style is recommended (see:
http://www.acm.org/sigs/pubs/proceed/template.html).

Papers must be submitted electronically in PDF or PostScript format.

Please send papers by e-mail to: eccb05-ont+text_AT_dcs.shef.ac.uk

WORKSHOP ORGANIZERS
===================

Chris Wroe (University of Manchester) cwroe_AT_cs.man.ac.uk
Rob Gaizauskas (University of Sheffield) r.gaizauskas_AT_dcs.shef.ac.uk
Christian Blaschke (Bioalma, Madrid) blaschke_AT_cnb.uam.es

PROGRAMME COMMITTEE
===================

Russ Altman (U. Stanford)
Sophia Ananiadou (NaCTeM)
A. Aronson (NLM)
Ted Briscoe (U. Cambridge)
Olivier Bodenrider (NLM)
Judith Blake (Jackson Laboratory)
Nigel Collier (Tokyo)
George Demetriou (U. Sheffield)
Carol Freidman (Columbia)
Ken Fukuda (Computational Biology Research Center, AIST, Tokyo)
Moustafa Ghanem (Imperial College)
Carole Goble (U. Manchester)
Lawrence Hunter (U. Colorado))
Udo Hahn (U. Jena)
Henk Harkema (U. Sheffield)
Lynette Hirschman (MITRE)
Ewan Klein (U. Edinburgh)
Phil Lord (U. Manchester)
Yves Lussier (Columbia University)
Adeline Nazarenko (Universite Paris-Nord, France)
Helen Parkinson (EBI)
Dietrich Rebholz-Schuhmann (EBI)
Patrick Ruch (University Hospital of Geneva)
Andrey Rzhetsky (Columbia University)
Stefan Schultz (U. Freiburg)
Robert Stevens (U. Manchester)
Jun'ichi Tsujii (U. of Tokoyo)
Alan Rector (U. Manchester)
Alfonso Valencia (Centro Nacional de Biotechnologia, Madrid)
Karin Verspoor (Los Alamos)
Bonnie Webber (U. Edinburgh)


--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 267.6.9 - Release Date: 11/06/2005








--__--__--

Message: 7
Date: Tue, 14 Jun 2005 21:36:43 +0100
From: Lou Burnard <lou.burnard_AT_oucs.ox.ac.uk>
To: sb_AT_ldc.upenn.edu
CC: CORPORA <CORPORA_AT_HD.UIB.NO>
Subject: [Corpora-List] Brown Corpus
Reply-To: corpora-archive_AT_uib.no

Well, every generation has every right to reinvent the work of their
predecessors, but such a "rationalization" seems to me to play somewhat
fast and loose with the design chosen by the original compilers of the
Brown Corpus... it was intended to comprise 500 equally-sized samples,
selected from 15 pre-defined categories. By lumping together all the
samples from the same category you will wind up with differently sized
samples -- category J ("learned") has 80 texts, c. 160,000 words; while
category R ("humor") has only 9, i.e. 18,000 words. That may not
matter, of course, for many applications, but it seems a shame to lose
that feature of the original design. Or will you preserve the original
text boundaries with some additional markup?
If so, you might like to consider that many of the original 2000 word
samples have some internal structure too: even within 2000 words, there
are samples taken from originally discontinuous sections of the
original. Which most versions seem to disregard.

antiquariously,

Lou


Steven Bird wrote:


>Note that this version of the Brown Corpus contains 500 files, each

>consisting of around 200 lines of text on average. Perhaps these were

>as big as they could handle back in 1961. I think it would make matters

>simpler if the file structure was rationalized now, so that, e.g.:

>

>Brown Corpus file names

>Existing -> Proposed

>ca01 .. ca44 -> a

>cb01 .. cb26 -> b

>etc

>

>(NB this is how things are being restructured in NLTK-Lite, a new,

>steamlined version of NLTK that will be released later this month.)

>

>-Steven Bird

>

>

>On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:

>

>

>>By one of those uncanny coincidences, I am planning to include an

>>XMLified version of the Brown corpus on the next edition of the BNC Baby

>>corpus sampler. The version I have is derived from the GPLd version

>>distributed as part of the LTK tool set (http://nltk.sourceforge.net)

>>and includes POS tagging; there is also a version which has been

>>enhanced to include Wordnet semantic tagging but I am not clear as to

>>the rights in that.

>>

>>Lou Burnard

>>

>>

>>Xiao, Zhonghua wrote:

>>

>>

>>>The plain text version of Brown is available here:

>>>http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt

>>>

>>>Richard

>>>________________________________



>>>

>>>From: owner-corpora_AT_lists.uib.no on behalf of Jörg Schuster

>>>Sent: Tue 14/06/2005 14:39

>>>To: CORPORA_AT_hd.uib.no

>>>Subject: [Corpora-List] Brown Corpus

>>>

>>>

>>>

>>>Hello,

>>>

>>>where can the Brown Corpus be downloaded or purchased?

>>>

>>>Jörg Schuster

>>>

>>>

>>>

>>>

>>>

>>>

>>>

>>>

>

>

>

>

>








--__--__--

Message: 8
From: "Adam Kilgarriff" <adam_AT_lexmasterclass.com>
To: "'Lou Burnard'" <lou.burnard_AT_oucs.ox.ac.uk>,
<sb_AT_ldc.upenn.edu>
Cc: "'CORPORA'" <CORPORA_AT_HD.UIB.NO>
Subject: [Corpora-List] Brown Corpus
Date: Fri, 17 Jun 2005 13:10:03 +0100
Reply-To: corpora-archive_AT_uib.no

All,

All,

Like Lou, I think the original structure of the Brown, with *same-size*
samples, has a lot to commend it.

Where samples are all the same length, you can talk about the mean and
standard deviation of a phenomenon (eg, the frequency of "the") across the
samples and it becomes easy to use the t-test to establish whether the
phenomenon is systematically more common in one text type than another.

If all samples are different lengths, it is not easy: you can't use mean and
standard deviation (and standard tests like T-test) and fancy maths are
likely to make the findings impenetrable and unconvincing.

Many studies on LOB, Brown and relations have benefited from the fixed
sample length.

I come across all too many papers that argue, roughly "hey look, word X is N
times as common in corpus A as against corpus B, now let's investigate why"
- and I'm left wondering whether N is enough of a difference to be salient,
given the usually unexplored level of within-text-type variation.

This argument leads me to propose the *pseudodocument*, a fixed-length run
of text of a given text type, truncated at (eg) the 10,000th word. By
treating text in this way, we can use mean, SD, and T-test to work out if
the level of variation between one text type and another is significant.

You read it first on Corpora!

Adam

-----Original Message-----

From: owner-corpora_AT_lists.uib.no [mailto:owner-corpora_AT_lists.uib.no] On
Behalf Of Lou Burnard
Sent: 14 June 2005 21:37
To: sb_AT_ldc.upenn.edu
Cc: CORPORA
Subject: Re: [Corpora-List] Brown Corpus

Well, every generation has every right to reinvent the work of their
predecessors, but such a "rationalization" seems to me to play somewhat
fast and loose with the design chosen by the original compilers of the
Brown Corpus... it was intended to comprise 500 equally-sized samples,
selected from 15 pre-defined categories. By lumping together all the
samples from the same category you will wind up with differently sized
samples -- category J ("learned") has 80 texts, c. 160,000 words; while
category R ("humor") has only 9, i.e. 18,000 words. That may not
matter, of course, for many applications, but it seems a shame to lose
that feature of the original design. Or will you preserve the original
text boundaries with some additional markup?
If so, you might like to consider that many of the original 2000 word
samples have some internal structure too: even within 2000 words, there
are samples taken from originally discontinuous sections of the
original. Which most versions seem to disregard.

antiquariously,

Lou


Steven Bird wrote:


>Note that this version of the Brown Corpus contains 500 files, each

>consisting of around 200 lines of text on average. Perhaps these were

>as big as they could handle back in 1961. I think it would make matters

>simpler if the file structure was rationalized now, so that, e.g.:

>

>Brown Corpus file names

>Existing -> Proposed

>ca01 .. ca44 -> a

>cb01 .. cb26 -> b

>etc

>

>(NB this is how things are being restructured in NLTK-Lite, a new,

>steamlined version of NLTK that will be released later this month.)

>

>-Steven Bird

>

>

>On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:

>

>

>>By one of those uncanny coincidences, I am planning to include an

>>XMLified version of the Brown corpus on the next edition of the BNC Baby

>>corpus sampler. The version I have is derived from the GPLd version

>>distributed as part of the LTK tool set (http://nltk.sourceforge.net)

>>and includes POS tagging; there is also a version which has been

>>enhanced to include Wordnet semantic tagging but I am not clear as to

>>the rights in that.

>>

>>Lou Burnard

>>

>>

>>Xiao, Zhonghua wrote:

>>

>>

>>>The plain text version of Brown is available here:

>>>http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt

>>>

>>>Richard

>>>________________________________



>>>

>>>From: owner-corpora_AT_lists.uib.no on behalf of Jörg Schuster

>>>Sent: Tue 14/06/2005 14:39

>>>To: CORPORA_AT_hd.uib.no

>>>Subject: [Corpora-List] Brown Corpus

>>>

>>>

>>>

>>>Hello,

>>>

>>>where can the Brown Corpus be downloaded or purchased?

>>>

>>>Jörg Schuster

>>>

>>>

>>>

>>>

>>>

>>>

>>>

>>>

>

>

>

>

>









--__--__--

Message: 9
Date: Fri, 17 Jun 2005 14:28:24 +0200
From: Jean Veronis <Jean.Veronis_AT_up.univ-mrs.fr>
To: Adam Kilgarriff <adam_AT_lexmasterclass.com>
Cc: 'Lou Burnard' <lou.burnard_AT_oucs.ox.ac.uk>, sb_AT_ldc.upenn.edu,
'CORPORA' <CORPORA_AT_HD.UIB.NO>
Subject: [Corpora-List] Brown Corpus
Reply-To: corpora-archive_AT_uib.no

Hi Adam,

Although I agree on the same-size sample design, I am less convinced by
the use of the mean and standard deviation on corpora (as well as
t-score and a few others). The distributions are so strongly skewed that
these measures are probably not advisable. Without getting into anything
too complicated, the median and measures based on it, like the MAD (mean
absolute deviation), and in general what's called "robust statistics",
seem preferable to me.

--j
http://aixtal.blogspot.com









__--__--

Send Corpora-archive mailing list submissions to
corpora-archive at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.uib.no/listinfo/corpora-archive
or, via email, send a message with subject or body 'help' to
corpora-archive-request at uib.no

You can reach the person managing the list at
corpora-archive-admin at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora-archive digest..."

--__--__--

_______________________________________________
Corpora-archive mailing list
Corpora-archive at uib.no
http://mailman.uib.no/listinfo/corpora-archive


End of Corpora-archive Digest




More information about the Corpora-archive mailing list