[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 24 23:21:54 CET 2015


/New publications/:

*Avocado Research Email Collection * <#avo> *GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 * <#gale> *RATS Speech Activity Detection* <#rats> ------------------------------------------------------------------------

*New publications *

(1) Avocado Research Email Collection <https://catalog.ldc.upenn.edu/LDC2015T03> consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada".

The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.

The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

*

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 <https://catalog.ldc.upenn.edu/LDC2015T04>was developed by LDC and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language

Genre

Files

Words

CharTokens

Segments

Chinese

BC

92

67,354

101,032

2,714

Chinese

BN

34

93,992

140,988

3,314

Total

126

161,346

242,020

6,028

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

* Identifying, aligning, and tagging eight different types of links

* Identifying, attaching, and tagging local-level unmatched words

* Identifying and tagging sentence/discourse-level unmatched words

* Identifying and tagging all instances of Chinese ? (DE) except when

they were a part of a semantic link

*

(3) RATS Speech Activity Detection <https://catalog.ldc.upenn.edu/LDC2015S02>was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13 <https://catalog.ldc.upenn.edu/LDC2004S13>, LDC2005S13 <https://catalog.ldc.upenn.edu/LDC2005S13>), and Fisher Levantine Arabic telephone studies (LDC2007S02 <https://catalog.ldc.upenn.edu/LDC2007S02>), as well as from CALLFRIEND Farsi (LDC2014S01 <https://catalog.ldc.upenn.edu/LDC2014S01>).

Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed.

All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.

------------------------------------------------------------------------

-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 17470 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150224/40c08e97/attachment.txt>



More information about the Corpora mailing list