I'm looking for an opinion on terms used related to typologies of corpora.
Some bodies of strings in a "text" are without annotated structure. (They may have language informed structure (such as sentence or clause patterns), but they are in essence a glob of text). Is this a corpus? or must a corpus also have some annotated informatic structure? — such as a corpus of newspaper articles where each article is annotated for its beginning and end. Some researchers have used the terminology 'a corpus of texts', indicating that the component parts of a corpus is some independent body of words which is known as "a text".
If I have 15 bi-lingual lists (a highly structured format of 'text') which are in the format of Language A - B; where language A is the same across the lists, but Language B is different in each list, and I were to be able to cite each list independently, or the compilation all together, how would I terminologically refer to the part-whole relationship? can a list be a 'text of a corpus'?
Is each list, a corpus or is the whole collection a corpus? or can the term corpus be expected to apply to both part and whole?
Any citable examples of corpora which contain component parts comprised of lists — especially bilingual wordlists, would be appreciated.
If you want to reply off list, in 5-6 days I'll post an anonymized summary.
Extra: In general within the corpora using sciences, how much parsing and annotation is required before the annotated corpus constitutes using the term 'database'?
all the best, - Hugh Paterson III -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1881 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200331/92504493/attachment.txt>