[Corpora-List] Identifying relevant key-n-grams (in analogy to keywords)

Robert Fuchs robert.fuchs.dd at googlemail.com
Fri Nov 19 20:02:33 CET 2021


*Dear all,*


We are comparing a reference corpus and a target corpus in order to identify keywords and key phrases on a particular topic that is prominent in the target purpose but not in the reference corpus. We use log ratio and statistical significance  in order to identify candidates for keywords, i.e. 1-grams, and then go through the rest manually in order to identify those that are relevant to the topic at hand (e.g. unemployment and labour relations). We remove items that are not relevant, for example if there was a random event like a particular sports tournament during the period of the target corpus.

In addition, we are looking at n-grams with n greater 1 and and we're not sure how to decide which n-grams are relevant. For example, “unemployment causes poverty” is certainly relevant. On the other hand, “unemployment is” or “the unemployed are” or “unemployment causes” are not relevant.

I would be interested in hearing about any established practices about how to distinguish relevant from non-relevant n-grams, or more generally any thoughts on how this can be done in a principled way other than making ad hoc decisions.

A solution we have considered so far is to exclude n-grams that only consist of function words in addition to a single content word that we already identified as a relevant keyword/1-gram. Other than this simple solution, we were wondering if there are more advanced approaches to this problem.

Thanks and best



-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/site/rflinguistics/

Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join Are you a non-native speaker of English? Please help us by taking this short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5302 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211119/bbd342be/attachment.txt>

More information about the Corpora mailing list