FW: [Corpora-List] The genre of the Web

Mark Davies Mark_Davies at byu.edu
Sun Sep 18 23:01:02 CEST 2005

As I mentioned in my original post, we all know that there is a bit of every register on the Web -- SPOKEN (transcripts of interviews, etc), FICTION (repositories of literature), lots of NEWSPAPERS, ACADEMIC-oriented materials, etc etc. So, no question about that of course -- the Web has a bit of everything.

The original question, though, was which genres/registers (of the BNC, for example) would have frequency data that would correspond *most closely* to reliable frequency data from the web -- i.e. for the Web *as a whole*?

In some very, very preliminary work that I've done, it appears that the frequency data from the web is *most* in line with the frequency data from either the newspaper or academic registers of the BNC, rather than spoken or fiction. Again, not to say that there isn't a bit of everything, but it is *most similar* to the registers just mentioned.

Part of the reason that I asked the question in the first place has to do with pedagogical concerns. Suppose that my students obtain frequency data from the web as well as frequency data from a spoken corpus. My guess is that they will find a fair amount of frequency data (lexical, grammatical, etc) in the spoken corpus that are relatively more common than that of the Web, and vice versa. My guess, though (based on very preliminary data) is that there would be less of a mismatch with newspaper or academic-based corpora.

