[Corpora-List] The genre of the Web

Mark Davies Mark_Davies at byu.edu
Sun Sep 18 18:52:00 CEST 2005

I'm looking for publications or URLs that look at the genre of the web in quantitative terms.

In other words, if one looks at the four major genres/registers SPOKEN, FICTION, NEWSPAPER, ACADEMIC, most would probably agree that the web is more like NEWSPAPER and ACADEMIC than it is SPOKEN or FICTION, although there are certainly bits and pieces of all of these genres/registers on the web.

I imagine that something like the following has already been done, but it would seem that a person could look at the frequency of 50-60 words or phrases in the major genres/registers of the BNC, for example, and then compare this to the frequency of the same words and phrases on the Web. In quantitative terms, the web would be "most like" the register with the highest correlation coefficient.

Three notes:
1) A BNC-based site like VIEW [http://view.byu.edu] allows users to quickly compare the frequency in different registers [use "Charts" on the VIEW site].
2) This assumes we can abstract away from the basic methodological problem of calculating frequencies from the web -- an issues that has been discussed in a number of threads here on CORPORA.
3) This is a very simplistic lexically-oriented comparison, with no attempt to look at syntactic features, etc.

On the other hand, does it even make sense to try and relate the overall genre orientation of the web to one of these four or five discrete genres? Would it be better to simply refer to it as as mix of GENRE1 + GENRE2? Going even further, does it make sense to even try and relate the web to pre-defined genres, rather than perhaps just referring to it as its own "Web" register?

Thanks in advance,

Mark Davies

Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

More information about the Corpora-archive mailing list