LDC2009T20* - Czech Broadcast Conversation MDE Transcripts <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20> -*
LDC2009T21* - S <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21>panish Gigaword Second Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21> - *
The Linguistic Data Consortium (LDC) would like to announce the availability of three new publications.
------------------------------------------------------------------------
*New Publications*
(1) Czech Broadcast Conversation Speech <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02> was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic, and consists of 40 hours of speech from Radioforum, a talk show broadcast on Czech Radio 1. Transcripts corresponding to the audio files in this corpus are provided in Czech Broadcast Conversation MDE Transcripts (LDC2009T20) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>.
Czech Broadcast Conversation Speech consists of 72 single channel recordings of Radioforum, a live talk program broadcast by Czech Radio 1 (CRo1) <http://www.rozhlas.cz/radiozurnal/portal/> every weekday evening. Its format consists of invited guests spontaneously answering topical questions posed by one or two interviewers. The number of interviewees in a single program varies from one to three, but typically, one interviewer and two interviewees appear in the program. The material includes passages of interactive dialogue, but longer stretches of monologue-like speech comprise the majority of the collected data. Radioforum also has an interactive segment where listeners call the studio and ask their own questions. That telephony speech was not transcribed in the current release.
Individual recordings range from 27 minutes to 36 minutes each. The recordings were collected during the period from February 12, 2003 through June 26, 2003. The signal is mono, sampled at 22.05 kHz with 16-bit resolution, stored in Windows PCM waveform format. The names of the audio files refer to the broadcast date (rfYYMMDD.wav).
*
(2) Czech Broadcast Conversation MDE Transcripts <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>* *was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic, and consists of approximately 33 hours of transcribed speech from Radioforum, a talk show broadcast on Czech Radio 1. The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast Conversation Speech (LDC2009S02) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>.
Czech Broadcast Conversation MDE Transcripts was created to extend Metadata Extraction (MDE) research to conversational Czech. The goal of MDE is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in three different formats: TRS (Transcriber <http://trans.sourceforge.net>), QAn (Quick Annotator <http://www.mde.zcu.cz/qan.html>), and RTTM. TRS represents a standard speech transcript. QAn and RTTM also contain information about structural metadata (MDE). Character encoding in all files is ISO-8859-2.
All filenames have the form rfYYMMDD.format where "rf" stands for Radioforum, the following six digits indicate the date of broadcast, and the extension ".format" corresponds to the data format of the particular file ".trs", ".qan", or ".rttm".
*
(3) Spanish Gigaword Second Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21> is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12> and adds data collected from January 1, 2006 through December 31, 2008.
The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:
* Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
* Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
* Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008
The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.
------------------------------------------------------------------------
Ilya Ahtaridis Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8458 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20090724/9ed4503c/attachment.txt>