[Corpora-List] sequences of @ in the coca corpus

Mark Davies Mark_Davies at byu.edu
Thu Mar 19 14:42:50 CET 2020


Nicolas,


>> one of my students ask me why we have sequences of @ in the text body of news in the Coca corpus ?

I think you might be referring to the downloadable, offline version of COCA at https://www.corpusdata.org/. If so, please see: https://www.corpusdata.org/limitations.asp.

The online corpus (https://www.english-corpora.org/coca/) wouldn't have those -- all one billion words of data is there.

MD

============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

________________________________________ From: Nicolas TURENNE <nicolas.turenne at univ-eiffel.fr> Sent: Thursday, March 19, 2020 3:21 AM To: Mark Davies Cc: Corpora list Subject: sequences of @ in the coca corpus

hello Mark

one of my students ask me why we have sequences of @ in the text body of news in the Coca corpus ?

thank you best nicolas

-- Nicolas Turenne Data Science Group, Division of Science and Technology UIC United International College Beijing Normal University-Hong Kong Baptist University. Zhuhai Campus, China.

JDMDH Journal of Data Mining and Digital Humanities

Recent publications

Turenne N (2018). « The rumour spectrum ». PLoS ONE 13(1): e0189080. doi:10.1371/journal.pone.0189080

Dang Q, Turenne N, Valette M (2018) « Using smog-related data of Chinese Sina Weibo to explore correlation between health issues and relevant regions ». In the Proceedings of the 13th International Workshop on Natural Language Processing and Cognitive Science, (NLPCS), Kraków, Poland, 11-12 September 2018.

Ermakova L, Bordignon F, Turenne N and Noel M (2018) « Is the Abstract a Mere Teaser? Evaluating Generosity of Article Abstracts in the Environmental Sciences ». Front. Res. Metr. Anal. 3:16. doi: 10.3389/frma.2018.00016

Turenne N., Sokolova D. and Zassoursky I, (2018). « Женщины-исследователи в открытой научной коммуникации » (en français: Les femmes chercheuses dans l'espace ouvert de communication scientifique ou Open Access) In proceedings of Журналистика в 2017 году: творчество, профессия, индустрия" (en français: "Le journalisme en 2017 : innovation, métier, industrie"), 5-7 February 2018, Moscow State University, Russia.

Mazoyer B., Turenne N. and Viaud M.L. (2017). « Étude des influences réciproques entre médias sociaux et médias traditionnels », In L. Amsaleg, V. Claveau and X. Tannier. Actes de l’atelier Journalisme Computationnel. France. hal-01643634

Turenne N. (2016) « Analyse de Données Textuelles sous R », ISTE Editions, 318 p., 2016, Collection Sciences cognitives (ISTE), 9781784061074 (ebook). hal-01292349

Turenne N. (2013) « Knowledge Needs and Information Extraction », ISTE-Wiley, 288 pages, 9781118574560 (ebook). doi: 10.1002/9781118574560

Editor of Proceedings

Workshop Data Science and Digital humanities (@EGC 2018), Maison des Sciences de l'Homme de Paris Nord. 23 January 2018. https://hal.archives-ouvertes.fr/hal-01691918v1

Workshop Information Extraction for Social Media Analysis, University Paris-Est Marne-la-Vallée (UPEM). 11 October 2016. http://ligm.u-pem.fr/index.php?id=28169

--

----- Mail d’origine ----- De: Mark Davies <Mark_Davies at byu.edu> À: Corpora list <corpora at uib.no> Envoyé: Thu, 17 May 2018 15:19:20 +0200 (CEST) Objet: [Corpora-List] The new 14 billion word iWeb corpus (from the BYU corpora)

We have just released the new 14 billion word iWeb corpus<https://corpus.byu.edu/iweb/>, which complements other BYU corpora<https://corpus.byu.edu/> such as COCA, COHA, NOW, BYU-BNC, GloWbE, Wikipedia, and EEBO.

At 14 billion words, iWeb is more than 25 times as large as the 560 million word COCA corpus. iWeb also has a much wider range of web-based materials than does COCA, since it is based on 22 million web pages in nearly 100,000 carefully selected websites (based on Alexa.com<https://www.alexa.com/topsites>, from Amazon).

New in iWeb is the ability to browse through the top 60,000 words in the corpus, and to search this list by word form, part of speech, rank (#1-60,000), and even pronunciation.

Most importantly, you can then see detailed information on each of the top 60,000 words in the corpus – definition, frequency information, synonyms and other related words (from WordNet, word families, MRC, etc), collocates (in a much improved format), related “topics” (perhaps much more useful than collocates), “clusters” (new in iWeb), relevant websites, and sample concordance/KWIC lines. Extensive hyperlinks allow you to easily and quickly move from one word to a number of related words.

In addition, for each of these 60,000 words, there are “quick links” to related data from other websites – pronunciation, additional definitions, images, videos, and translations (for more than 100 languages).

Finally, in terms of “standard” corpus searches, we note that (due to improvements in the corpus architecture) iWeb is faster than any of the other BYU corpora, and it is typically much faster than other large, 10-20 billion word online corpora. iWeb also allows you to quickly and easily create “virtual corpora” on nearly any topic, and these virtual corpora can then be searched as their own “stand-alone” corpora, or compared to other virtual corpora that you have created.

For a short overview of the corpus (in graphical format, with an emphasis on the new features), please see:

https://corpus.byu.edu/iweb/help/iweb_overview.pdf

We hope that this new corpus is useful to you in your teaching, learning, and research.

Best,

Mark Davies

BYU Corpora

============================================

Mark Davies

Professor of Linguistics / Brigham Young University

http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **

** Historical linguistics // Language variation **

** English, Spanish, and Portuguese **

============================================



More information about the Corpora mailing list