[Corpora-List] 1st Announcement: CL2007 Colloquium on Corpora and Genre

santinim at inwind.it santinim at inwind.it
Wed Mar 21 14:53:00 CET 2007

Apologies for multiple postings


COLLOQUIUM: "Towards a Reference Corpus of Web Genres"

Colloquium held in conjunction with the Fourth Corpus Linguistics conference located at the University of Birmingham, 27-30 July 2007.

***The exact date and time schedule of the Colloquium will be announced later on***

Organizers: Marina Santini and Serge Sharoff

Workshop website: http://corpus.leeds.ac.uk/serge/webgenres/

Corpus Linguistics 2007 website: http://www.corpus.bham.ac.uk/conference2007


Genres of spoken and written texts are being intensively studied from various angles, e.g., communication studies, discourse analysis, computational linguistics, without arriving at a generally accepted definition. Many corpora have been built to represent the language, but very few large corpora indicate genres, and when they do the typology of genres varies widely. For instance, the Brown corpus famously uses 15 textual categories, from press reportage (a text genre) to religion or skills and hobbies (domains), while the British National Corpus (BNC) uses 70 classes, such as academic or non-academic scientific texts or biography. Interestingly, genre classes in the BNC are an add-on proposed by David Lee (Lee, 2001) after the corpus construction, rather than a basic criterion of the corpus creation. The genre attribute was included in a few collections used in information retrieval (TREC HARD 2003 and 2004, or TREC-2006 Blog Track), but the set of genres proposed was either debatable (e.g. the ‘reaction’ genre in TREC HARD 2003), or limited to a single genre (e.g. the blog genre in TREC-2006 Blog Track).

The web is new, so it is even less not clear how to apply traditional notions of genre to web documents. In corpus-based genre studies, the main tendency has been to build one's own genre collection according to subjective criteria for corpus composition, genre annotation, and genre granularity. Genre annotation has been based either on the common sense of a single rater, or on the agreement of few annotators. In brief, as it is now, web genre analyses remain self-contained and corpus-dependent.

Building a reference corpus of web genres is certainly difficult because web documents are often characterised by a high level of genre hybridism, by a fragmentation of textuality across several documents, by the impact of technical features such as hyperlinking, posting facilities and multi-authoring. Since the web is a huge reservoir of documents that can be easily mined for building all sorts of corpora, it is important to overcome the subjectivity that characterizes genre-related issues, in order to create sharable resources. What should we consider when designing a reference corpus of web genres? Genres of web documents show some traits that are not accounted for in TREC collections or in the BNC and that are, instead, important on the web. For example:

* Genre Hybridism and Individualization
The fluidity and fast-paced dynamism of the web together with the complexity of web pages cause unclear genre conventions, and favour genre mixture and authorial creativity. These two phenomena appear to be very common on the web.

* Granularity of the Unit of Analysis
How many granularities of the unit of analysis should be included? Only genres representing web sites? Only genre representing web pages? Both?

* Format of Web Documents
An issue related to the previous one is represented by the 'format' that should be used to store the 'units of analysis' in a collection. In what form can a web page or a website be included in a corpus? In HTML format or in a text-only version? Including images or leaving them out? Removing boilerplates or keeping them? In, a database-like form, as DOM trees, as a net of graphs, in HTML format, or simply in a text-only version?

* Genre Granularity and Similarity
Genres can be accounted for at subgenre, genre and super-genre level: what level of genre granularity should be applied in the reference corpus? Furthermore, should similar genres, such as TUTORIAL and HOW-TO, be accounted for separately?

* How to build a Genre Palette
How many and which genres should be included in a genre reference corpus?

* Validation and Evaluation of a Reference Corpus of Web Genres
How can we validate and evaluate the quality of a genre corpus?

The rationale for this colloquium is to draw up an initial list of characteristics and requirements for building, annotating and evaluating reference corpora of web genres.

Four longer presentations prepared for the colloquium report empirical results and offer hands-on answers to some of these questions. More precisely, Alexander Mehler analyses web genres at website level and suggests a database-like form of storage. He offers an interesting angle on the notion of web genres using structural and linking information. Barbara H. Kwasnik, Kevin Crowston, Joseph Rubleske, You-Lee Chun tell us how they built a corpus of genre-tagged web pages to populate their genre collection. Serge Sharoff focuses on the similarities between web-derived corpora and classical corpora constructed from print media. Finally, Mark Rosso describes his experience in assembling a genre palette that could be useful for building a genre reference corpus to help web searches.

Shorter presentations describe settings of ongoing or future research, and provide preliminary answers to some of the problems listed above. More precisely, Andrea Stubbe and Christoph Ringlstetter discuss two important aspects in web genre research: granularity of genre hierarchies and multi-genre classification. Rosario Caballero and Noelia Ruiz-Madrid focuses on HOW-TO TEXTS and address the issue of similar genres. Andrea Stubbe, Christoph Ringlstetter, Tong Zheng, and Randy Goebe present an intriguing idea: a genre classifier that adapts to the information need of a specific user on the basis of user events. They report on how to assemble a genre-annotated corpus. Julia Almeida points out the importance of the pictorial information, which currently plays a minor role in genre analysis and corpus building and which might deserve more attention when dealing with web documents. Finally, Cornelius Puschmann proposes an XML-based storage schema for the compilation of computer-mediated discourse (CMD) corpora from mixed sources.

Building a genre-annotated reference corpus of web pages is arduous for a number of reasons, and several solutions appear to be viable. In this colloquium, we would like to make a first attempt to apply the concept of genre to the development of sharable criteria for building genre corpora.

The ambition of this colloquium, the first ever organized on this topic, is to bring together researchers from different communities such as corpus linguistics, genre analysis, digital genre community, computational linguistics, and information retrieval in order to promote the discussion and development of new ideas and methods to create new corpora for language studies and as evaluation resources.



* Alexander Mehler: A Corpus Model of Structure Formation in Hypertext Types
This paper describes a web genre corpus model. Its starting point is a graph model of the logical document structure of hypertext types and of the linkage of their constituents. We describe an XML-based serialization of this model and provide a database mapping which retains a wide range of web genre data. This will be exemplified by three web genres.

* Barbara H. Kwasnik, Kevin Crowston, Joseph Rubleske and You-Lee Chun: Building a Corpus of Genre-Tagged Webpages for an Information-Access Experiment
This presentation reports on one phase of a larger study whose overarching aim is to determine how providing genre metadata can help in access to sources of information in a digital environment. We have built a corpus of genre-tagged web pages and structured this particular experimental corpus in such a way as to provide the maximum control for our experiments. We recognize, however, that much rich genre information was either too difficult to represent or had to be pared away.

* Serge Sharoff: In the garden and in the jungle: comparing genres in the BNC and Internet
According to Adam Kilgarriff the BNC is a jungle when compared to smaller Brown-type corpora, but it looks more like an English garden when compared to the Internet. In this presentation I will compare English and Russian Internet corpora against their human-collected counterparts (BNC and RNC) using two methods: the first involves manual annotation of a subset of Internet corpora, the second one uses probabilistic classifiers. The study shows that the Internet is not radically different from the BNC: Internet corpora do contain a wide range of genres and approximate many genres that exist in their printed form, the same is true for the audience level (texts for professional or layman texts).

* Mark Rosso: Development of a Genre Palette
This presentation details the development of a genre palette used in the study of the effects of genre-annotated search results on the relevance judgement process in a web search environment. This palette development was conducted in several phases: (i) a survey of user terminology; (ii) user-based refinement of terminology into a tentative genre palette, and (iii) user validation of the genre palette.


* Andrea Stubbe and Christoph Ringlstetter: Recognizing Genres
We introduce a two-level hierarchy of genres based on the definition of genre in terms of form and function (or purpose). Thereby we provide sufficient granularity with the possibility to return to a coarser scheme when preferable. As some texts may naturally fall into more than one genre, an assignment to multiple classes is possible. For those applications where a unique class is required, several techniques for the combination of classifiers were evaluated.

* Rosario Caballero and Noelia Ruiz-Madrid: The impact of technology on how-to texts: Issues and prospects
This paper explores online how-to texts produced by private and public entities. Together with analysing the link system of the texts (using C-map) we discuss (a) whether authorial differences have an impact on the texts' architecture, (b) the way(s) users search for the texts on the web and their representation of the genre, and (c) the heading used to store such a corpus.

* Andrea Stubbe, Christoph Ringlstetter, Tong Zheng, and Randy Goebe: Incremental genre classification
In this presentation we will describe attempts to acquire data. These attempts have to consider the users explicitly and cooperatively. The user behaviour will be simulated using annotated corpus data. We will also formulate different scenarios for information gain representing different levels of uncertainty. Our goal is to integrate existing material of different sources into a realistic application.

* Julia Almeida: A Text and image in web context
We will propose new connections between linguistics and semiotics in order to redefine the relations between image and text. We intend to construct an approach to elucidate peculiarities of a texts corpus from web (including several genres) and explore the notion of textuality in web context.

* Cornelius Puschmann: SchemaCMD: An XML-based storage schema for the compilation of mixed-source CMD corpora
This presentation will outline an XML schema for the segmentation and storage of data from Internet sources, specifically those which utilize so-called web feeds (often associated with the RSS protocol). It is based on the faceted classification scheme recently proposed by Susan Herring and aims to make data from diverse sources accessible and comparable in a single format.

Information on registration and registration fees are provided at the CL2007 website: http://www.corpus.bham.ac.uk/conference2007

Corpus Linguistics 2007 Conference Dates: 27-30 July 2007
Corpus Linguistics 2007 Venue: University of Birmingham, Birmingham, UK
***The exact date and time schedule of the Colloquium will be announced later on***

Marco Baroni (University of Trento, Italy)
Stefan Gries (University of California, USA)
Adam Kilgarriff (Lexmasterclass, UK)
Alexander Mehler (Bielefeld University, Germany)
Sven Meyer zu Eissen (University of Weimar, Germany)
Paul Rayson (UCREL, Lancaster University, UK)
Georg Rehm (University of Tuebingen, Germany)
Marina Santini (University of Brighton, UK)
Serge Sharoff (University of Leeds, UK)
Benno Stein (University of Weimar, Germany)

Marina Santini (University of Brighton, UK)
Email: MarinaSantini.MSgmail.com
Personal Home Page: http://www.nltg.brighton.ac.uk/home/Marina.Santini/

Serge Sharoff (University of Leeds, UK)
Email: s.sharoffleeds.ac.uk
Personal Home Page: http://corpus.leeds.ac.uk/serge/

For questions or comments, please contact Marina Santini (MarinaSantini.MS at gmail.com), or Serge Sharoff (s.sharoff at leeds.ac.uk).

Con Prometeo prestiti senza spese fino a 31.000 Euro! Clicca qui

More information about the Corpora-archive mailing list