[Corpora-List] Corpus del Español Actual (CEA) / The Corpus of Contemporary Spanish

Carlos Subirats carlos.subirats at gmail.com
Thu Apr 26 08:44:27 CEST 2012


<http://sfn.uab.es:9080/SFN/tools/cea/spanish>*Corpus del Español Actual (CEA) / <http://sfncorpora.uab.es/CQPweb/cea/>The Corpus of Contemporary Spanish <http://sfncorpora.uab.es/CQPweb/cea/>* (Powered by CQPweb)

The *Corpus del Español Actual <http://sfncorpora.uab.es/CQPweb/cea/>* (the Corpus of Contemporary Spanish) contains *540 million words*, which have been lemmatized and tagged with detailed part-of-speech information. The CEA is made up of the following texts:

- The Spanish part of the eleven-language parallel corpus Europarl:

European Parliament Proceedings Parallel Corpus, v. 6<http://www.statmt.org/europarl/>(1996-2010);

- The Spanish portion of the trilingual Wikicorpus, v. 1.0<http://www.lsi.upc.edu/%7Enlp/wikicorpus/>,

which was extracted from a snapshot of Wikipedia (2006); and

- The Spanish part of the seven-language parallel corpus MultiUN:

Multilingual UN Parallel Text 2000-2009<http://www.euromatrixplus.net/multi-un/>,

a corpus made up of the resolutions of the United Nations.

The CEA was tagged using an online Spanish dictionary<http://sfn.uab.es:9080/SFN/tools/dictionary>containing 635,000 wordforms, which was automatically generated from a dictionary of 86,000 single-word lemmas (e.g., *unir*,* inmoralidad*,* allí*) and 26,000 multiword lemmas (e.g., *muerte cerebral*,* carga de profundidad*, *de armas tomar*)* *(Subirats 1989, 1992, 1994a, 1994b; Mogorrón 1994; Garrido 1999; Bobes 2000). Tag disambiguation was carried out with intersecting finite-state automata using lexical and syntactic information (Subirats 1998, Subirats and Ortega 2000, 2001, Ortega in progress).

*Searching the CEA:*

The query interface for the CEA is CQPweb<http://cwb.sourceforge.net/cqpweb.php>, which uses some of the components of the IMS Open Corpus Workbench (CWB)<http://cwb.sourceforge.net/>, a set of open-source tools for managing and searching large corpora -- including the Corpus Query Processor (CQP). To learn more about how to use CQPweb, you can consult the IMS's brief description of the regular-expression syntax<http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPSyntax.html>used by the CQP and their list of sample queries<http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPExamples.html>. If you wish to define your query in terms of grammatical and inflectional categories, you can use the part-of-speech tags listed on the CEA's Corpus Tags <http://sfn.uab.es:9080/SFN/tools/cea/corpus-tags> page. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3487 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120425/20ebbf29/attachment.txt>



More information about the Corpora mailing list