We are happy to announce the release of a new version of the OSCAR corpus. For those that do not know our corpus yet, the OSCAR project is an open source initiative that aims to distribute multiple multilingual textual corpora based on CommonCrawl for NLP and machine learning applications.
The latest corpus, OSCAR 22.01, is a new and improved, document-oriented multilingual web corpus based on the CommonCrawl dump from November/December 2021. The total corpus size is around 8TB, and it contains data in more than 152 languages plus a new multilingual subcorpus. It also contains document-level annotations, enabling simple (but limited, for now) document filtering based on some quality cues.
The corpus is available to researchers who can request access by mailing us to oscar-corpus at inria.fr. It is also now freely available for everyone on the HuggingFace's datasets hub: https://huggingface.co/datasets/oscar-corpus/OSCAR-2201. You will be able to use the corpus both by downloading it from HuggingFace servers or integrate it in your code using datasets and streaming mode.
We are also happy to announce that we are conducting a wide survey to better know the OSCAR user base, and to tailor future development and research around their use cases. The survey should take around 10 minutes to complete, and should provide us interesting information about your use cases and the reasons for using (or not using) OSCAR. The survey contains questions around OSCAR availability, ease of use, tooling and features. You can access the survey at the following link: https://forms.gle/BGEcaYkH2qWb3zrR7
We are looking forward to your remarks, questions and suggestions both in the survey, by email or via our Discord.
Best, The OSCAR Team https://oscar-corpus.com