[Corpora-List] ParaCrawl Corpus Release v2.0

TAUS Data mail at taus.net
Mon Oct 29 12:55:51 CET 2018

Here are all the details and links to download the corpus!

View this email in your browser (https://mailchi.mp/ed4747ddc8a9/paracrawl-corpus-release-v20?e=2c33279d04)

** ParaCrawl corpus release v2.0 ------------------------------------------------------------

The second version of the ParaCrawl corpus has been released! It contains parallel corpus for 17 languages paired with English. 6 new languages are added to the v2 release namely Irish, Croatian, Maltese, Lithuanian, Hungarian and Estonian. For the previously released languages (German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian and Finnish) more data is added to the corpus. For each language, two different versions of the corpus are released based on two cleaning tools, i.e. BiCleaner (https://github.com/bitextor/bicleaner) and Zipporah (https://github.com/hainan-xv/zipporah) . ParaCrawl corpus is crawled from a large number of websites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. FIND OUT CORPUS SIZE AND DOWNLOAD (http://paracrawl.eu/releases.html) The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is also available on Github (https://github.com/bitextor/bitextor/releases/tag/v6.0.0-rc.1) . * The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). * This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). * Updated releases are scheduled for October 2018, and March 2019. * The corpora are released under the Creative Commons CC0 license (https://creativecommons.org/share-your-work/public-domain/cc0/) ("no rights reserved").

Search, Share and Leverage Data The industry-shared Data Cloud is a repository of billions of words in multiple language pairs, across 17 industry domains and 9 content types. DISCOVER TAUS DATA CLOUD (https://www.taus.net/data-cloud-lp) To receive all updates related to data and keep up-to-date with the latest industry trends through great content Register for TAUS Newsletter (https://taus.us8.list-manage.com/subscribe?u=05d438ec905cfc9f1daa88a72&id=414a95d912) .

============================================================ ** (http://www.twitter.com/) ** (http://www.facebook.com) ** (http://mailchimp.com) Copyright 2018 TAUS BV, All rights reserved.

You're receiving this because you're interested in data.

Our mailing address is: TAUS BV Keizersgracht 74 Amsterdam, 1015 CT Netherlands Want to change how you receive these emails? You can ** update your preferences (https://taus.us8.list-manage.com/profile?u=05d438ec905cfc9f1daa88a72&id=ef1d153d97&e=2c33279d04) or ** unsubscribe from this list (https://taus.us8.list-manage.com/unsubscribe?u=05d438ec905cfc9f1daa88a72&id=ef1d153d97&e=2c33279d04&c=1f927646e3) . -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 57312 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181029/d655e359/attachment.txt>

More information about the Corpora mailing list