[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Sep 23 20:12:31 CEST 2015


*New publications:*

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 <#galece>

GALE Phase 3 and 4 Arabic Newswire Parallel Text <#gale34>

NewSoMe Corpus of Opinion in News Reports <#opinion>

------------------------------------------------------------------------

(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 <https://catalog.ldc.upenn.edu/LDC2015T18> was developed by LDC and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language

Genre

Files

Words

CharTokens

Segments

Chinese

BC

69

67,782

101,674

2,276

Chinese

BN

29

94,242

141,364

3,152

Total

98

162,024

243,038

5,428

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese ? (DE) except when

they were a part of a semantic link

*

(2) GALE Phase 3 and 4 Arabic Newswire Parallel Text <https://catalog.ldc.upenn.edu/LDC2015T19> was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 551 source-translation document pairs, comprising 156,775 tokens of Arabic source text and its English translation. Data is drawn from seven distinct Arabic newswire sources: Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations. Source data and translations are distributed in TDF format.

*

(3) NewSoMe Corpus of Opinion in News Reports <https://catalog.ldc.upenn.edu/LDC2015T17> was compiled at Barcelona Media <http://www.barcelonamedia.org/> and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower <http://www.crowdflower.com/> with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.

------------------------------------------------------------------------

-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11446 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150923/24f3e6d4/attachment.txt>



More information about the Corpora mailing list