[Corpora-List] Legal-domain corpora

Jernej Vicic jernej.vicic at pef.upr.si
Wed Oct 18 17:47:02 CEST 2006


You can try JRC-Acquis:

JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available

SIZE AND FORMAT

- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

LANGUAGES

Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.

TEXT TYPES

- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

PARAGRAPH ALIGNMENT

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

MANUAL SUBJECT DOMAIN CLASSIFICATION

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

USE / DOWNLOAD

- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.



Seth Grimes wrote:


>Hello all,

>

> I'm researching legal-domain application of NLP with machine

>learning. What annotated corpora are available in this domain, either for

>free or for a license fee? I'd be interested in --

>

>- legislation and statutes

>- case law

>- briefs, depositions & testimony, crime reports, and evidentiary

>materials

>- court judgments

>- patent filings

>

>-- and also in parallel, multi-lingual corpora, for instance that might

>have been created in the EU, Switzerland, Canada, and other areas with

>multiple official languages.

>

> I've been told that news-media text can provide good training

>material for the legal domain. I'd also be interested in hearing

>reactions to that claim, especially if anyone has formally studied the

>question.

>

> Thanks very much for all help,

>

> Seth

>

>

>--

>Seth Grimes Alta Plana Corp, analytical computing & data management

> Intelligent Enterprise magazine (CMP), Contributing Editor

>grimes at altaplana.com http://altaplana.com 301-270-0795

>

>

>






More information about the Corpora-archive mailing list