[Corpora-List] Identifying relevant key-n-grams (in analogy to keywords)

Steve Jeaco steve.jeaco at xjtlu.edu.cn
Mon Nov 22 02:47:04 CET 2021

Dear Robert,

My own approach for The Prime Machine is only implemented for what I call ready-made corpora (pre-processed collections held on its server; not Do-it-yourself corpora built by end users). I have key collocations as a by-product of the Key Labels processing. (Key Labels are metadata labelling from the corpus which are key for a word or collocation entered by the user; essentially asking how could I re-organize the corpus into sub-corpora in order to make the selected word or collocation a key word or collocation - see Corpora 15(2)).

Basically, in a relational database, firstly collocations of 2 to 5 words in length are calculated (IJCALLT 9(3)); secondly, frequencies inside and outside groupings of texts (or sentences) are calculated. Then Log-likelihood keyword contingency tables are made, multiplying the frequency by the length of collocation each time, and key collocations for each metadata item are generated. Finally, the end user can use the Research Tools -> Keywords -> Label Keywords tab to enter the first few letters of a label (e.g. medicine), select a label from the suggestions (e.g. Medicine) and get key collocations for a sub-section of a corpus compared to the remainder of that corpus (e.g. BNC 1994: Academic, [Medicine sub-section) vs [Humanities and Arts + Natural Science + Politics, Law and Education + Social Science + Technology and Engineering sub-sections]). Note: on smart phones you’d need to select “Full” mode before connecting as it defaults to a smaller interface without Research Tools if you are using a device with a small screen.

When trying to extract words (and collocations) which are key in the sense that they are intended to somehow correspond to what might be foregrounded/deviant/salient/marked for a reader, I find log-likelihood with bayes factors (Wilson, 2013) is suitable and gives useful results. I gave a rationale for continued use of log-likelihood in many situations in IJCL 25(2).

The method will give results like “patients with” (3052 vs 42) and “the presence of” (537 vs 975), for the example above as stop-lists are not used, but as these are based on collocations rather than n-grams, I think users will find most of tthese meaningful. Other key collocations for the example given include various multi-word disease names, groups/departments like “general practitioner” (234 vs 27) and “family health services” (132 vs 0) and other academic phrases like “there was no significant difference” (82 vs 4).

Essentially my approach begins with something more flexible than n-grams (2-5 word combinations with .. for a skipped item), then cuts down the combination using a collocation measure (LL + Bayes Factors); and then cuts down these key collocations using the key word method, (again with LL + Bayes Factors).

The reason I’ve not implemented this for do-it-yourself corpora in tPM is because (a) the user would need to wait for all collocations to be processed first; and (b) when applying the method using an external reference corpus, it takes rather longer to check combinations which make it as collocations in the study corpus against the whole reference corpus, rather than just the subset of pre-processed collocations. But there is already a function in tPM for do-it-yourself corpora to generate n-grams and collect frequencies of these in a ready-made corpus. I could look into adding key-collocations for DIY corpora in future if it seemed interesting. One main limitation of tPM at the moment is the range of ready-made corpora available beyond my home institution.

I can email through more details if you’re interested.

Stephen Jeaco, Suzhou, China

PS Hoping to get tPM for iPad back up on TestFlight in the next few weeks; I feel I’m forever playing catch-up with iOS system updates! So if you want to check out Key Labels use the Windows or MacOS versions of tPM.

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7809 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211122/8cb5f0ed/attachment.txt>

More information about the Corpora mailing list