I did go to ach.org and allc.org, and bookmarked them to check them out, but the searches I have done on specific topics have given me very little on the sort of things I mentioned ~
It may be caused by a comprehension artifact due to my coming from an
exact science background, but what can you do by, say, knowing that
"e" and "the" are the most used letter and word in English?
> > Are there any text corpora out there including phonemes also?
> Not sure what you mean here. Are you referring to transcriptions of
> speech, which might include more or less free variation at the phonemic
> level (the two pronunciations of 'roof' and 'route'), dialectal variation
> at the phonemic level (such as whether 'pin' and 'pen' are homophones), or
> phonemes which cannot be inferred from a pronunciation dictionary (e.g.
> the present and past tense pronunciations of 'read')?
I actually mean all these cases. If you ask a corpus "give me all words pronounced exactly like" "right", it should give you, namely: ~
"right" (adj.), "Wright" (English Last name (Wright Brothers)), "rite" (noun), "write" (verb) ~
along with the texts and offsets where they appear in the texts ~
Or, e.g., you could study all the instances of the word "wing" in a text corpora and its contextual usage patterns to come to the conclusion that a phrase like: ~
"right wing, left wing, chicken wing, ... I am political!" ~
could be meant as a pun ~
I am not a linguist myself, but even though I can count semiotics/linguistics as some of my true loves and I have done quite a bit of reading/coding on these subjects, IMHO, I think that linguistics hasn't gone far from the times Aristotle said as a way to somewhat measurably explain poetry in his "Poetika" that "the spring of life ..." (referring to youth) ~
Now that I am mentioning this I had another question. Have linguists/literature scientists written up a wish list of the features they expect from a corpus? ~