[Corpora-List] ParaCrawl Corpus Release v5.0

TAUS Data mail at taus.net
Tue Sep 17 14:50:39 CEST 2019

The fifth version of the ParaCrawl corpus has been released.

View this email in your browser (https://mailchi.mp/6c0e4a8efbae/paracrawl-corpus-release-v20-770785?e=2c33279d04) https://paracrawl.eu/news.html

** ParaCrawl corpus release v5.0 ------------------------------------------------------------

The fifth version of the ParaCrawl corpus has been released. It is the first release under the ParaCrawl action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". The latest release of the corpora contains newly crawled data, including data from Internet Archive. Enhancements in the document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all the official EU languages (23 languages paired with English).

Click the button below for corpora sizes and download links. GET MORE INFO (https://paracrawl.eu/releases.html)

The following chart shows an overview of the corpora sizes in terms of English word counts:

* We have also published some quality assessment results for ParaCrawl v5 with Europarl v7. See the full results here. (https://paracrawl.eu/releases.html#quality-assessment) * The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github (https://github.com/bitextor/bitextor) . * The ParaCrawl efforts will continue with the Broader Web-Scale Provision of Parallel Corpora for European Languages; focusing on more language pairs, ingesting more file formats beyond HTML, expanding the crawl coverage and applying domain filtering. * The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). * The corpora are released under the Creative Commons CC0 license (https://creativecommons.org/share-your-work/public-domain/cc0/) ("no rights reserved").

Get the latest news regarding the ParaCrawl project on Twitter ParaCawl now has an official Twitter account. Make sure to follow it: @ParaCrawl (https://twitter.com/ParaCrawl)

To receive all the updates related to data and keep up-to-date with the latest industry trends through great content Register for TAUS Newsletter (https://taus.us8.list-manage.com/subscribe?u=05d438ec905cfc9f1daa88a72&id=414a95d912) .

============================================================ ** (http://www.twitter.com/) ** (http://www.facebook.com) ** (http://mailchimp.com) Copyright 2019 TAUS BV, All rights reserved.

You're receiving this because you're interested in data.

Our mailing address is: TAUS BV Keizersgracht 74 Amsterdam, 1015 CT Netherlands Want to change how you receive these emails? You can ** update your preferences (https://taus.us8.list-manage.com/profile?u=05d438ec905cfc9f1daa88a72&id=ef1d153d97&e=2c33279d04) or ** unsubscribe from this list (https://taus.us8.list-manage.com/unsubscribe?u=05d438ec905cfc9f1daa88a72&id=ef1d153d97&e=2c33279d04&c=94061cd5e2) . -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 60756 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190917/a6f67cf1/attachment.txt>

More information about the Corpora mailing list