[Corpora-List] Divisions of corpus files in the BNC World

Stefan Th. Gries stgries_lists at arcor.de
Sun Jun 25 07:59:00 CEST 2006


Hi all

I have a question concerning the files from the BNC World from the genre of "W_essay_school" (using David Lee's label). Obviously, the files contain essays from several different students, both adults and teens depending on the exact file. However, I have not been able to find a straightforward way to determine

- the number of different students whose essays entered into the file (which I would like to have found in the header);
- the exact locations where the essays of the different students that were lumped together in any one file begin/end.

My heuristic so far has been to rely on <head> and </head> since these should usually indicate the heading of a different essay, but

(i) that's just been my heuristic and I am wondering whether there's a more principled way;
(ii) that does of course not guarantee that the new essay is by a diffferent student.

I apologize if that's a stupid question to which I should know the answer myself but I have not been able to get my head around this. Any pointers either via the list or to me directly would be greatly appreciated ... Thanks a lot for any help you might be able to offer, and I'll post a summary of the responses.
Best,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 44,85 € inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2





More information about the Corpora-archive mailing list