[Corpora-List] "Multi-encoded" corpora

Albretch Mueller lbrtchx at gmail.com
Mon Oct 6 03:49:55 CEST 2008


I was browsing around the BAWE corpus info previously posted here and when I noticed all texts are in PDF format (!), it made me wonder about how do you treat multi-encoded text, say scientific texts containing mathematical formulas, programming books containing actual code, ... ~

I think communicative universes are mostly, if not always, multi-encoded (I just don't know how to call that, but "multi-lingual" is it not) and all these code-planes participate while communicating; when you go eat some place; you: ~

1) read a menu

2) of food made after some recipe

3) talk to the wait[er|ress]

4) pay . . . ~

Or, which is what I have in mind, say you want to encode Euclid's Elements, including all definitions, postulates (axioms), propositions (theorems and constructions), mathematical proofs of the propositions, charts, apocrypha sections, ... and then do the same with the articles that tried to prove the 5th postulate, the still ongoing philosophical/logical inquiries, ... even including Schopenhauer's beef with the obsession we Mathematicians had for more than 20 centuries with this issue ;-) ~

http://en.wikipedia.org/wiki/Schopenhauer%27s_criticism_of_the_proofs_of_the_Parallel_Postulate ~

When I say multi-encoded here I mean code in a general way, for example there is a difference and interplay between what is written as law and what is talked about in court. These, to me, are two different "codes" ... even though the same NL is being used ~

By the way I am looking at these issues more from a semiotic point of view than a linguistic one ~

... and going back to the BAWE corpus, I know there are ways to have pdf format (essentially a picture) as text in this preprocessing format they use (was it lex?), what I don't know is how good is this textual preprocessing format at describing drawings ~

By the way I know you can use pdf2txt, but you will be loosing all that is not plain text ~



More information about the Corpora mailing list