[Corpora-List] Identifying relevant key-n-grams (in analogy to keywords)

Costas Gabrielatos Gabrielc at edgehill.ac.uk
Mon Nov 22 15:11:31 CET 2021


Dear Robert

The relevance of n-grams cannot be established simply by looking at the list. Manual examination of instances (with sufficient co-text) is needed. Why don't you select n-grams using the same criteria you used for selecting 1-grams?

Best regards

Costas

.................................................................... Dr Costas Gabrielatos Reader in Corpus Linguistics & English Language Edge Hill University https:/ehu.ac.uk/gabrielatos<https://www.edgehill.ac.uk/english/dr-costas-gabrielatos>

From: corpora-bounces at uib.no <corpora-bounces at uib.no> On Behalf Of Rodrigo Esteves de Lima Lopes Sent: 22 November 2021 13:45 To: Robert Fuchs <robert.fuchs.dd at googlemail.com> Cc: corpora at uib.no Subject: Re: [Corpora-List] Identifying relevant key-n-grams (in analogy to keywords)

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and believe the content to be safe.

Dear Robert,

You might also have a look at it:

Bondi, Marina & Mike Scott (eds.). 2010. Keyness in Texts. Amsterdam/Philadelphia: John Benjamins Publishing Company. 251 pp. ISBN 978-90272-8766-3.<https://journals.openedition.org/asp/4932> All the best, Rodrigo

[https://drive.google.com/uc?id=1SMThEGEh23f48PHqZeDbWfJUB2KUbsTp&export=download]

On Fri, 19 Nov 2021 at 16:13, Robert Fuchs <robert.fuchs.dd at googlemail.com<mailto:robert.fuchs.dd at googlemail.com>> wrote:

Dear all,

We are comparing a reference corpus and a target corpus in order to identify keywords and key phrases on a particular topic that is prominent in the target purpose but not in the reference corpus. We use log ratio and statistical significance in order to identify candidates for keywords, i.e. 1-grams, and then go through the rest manually in order to identify those that are relevant to the topic at hand (e.g. unemployment and labour relations). We remove items that are not relevant, for example if there was a random event like a particular sports tournament during the period of the target corpus.

In addition, we are looking at n-grams with n greater 1 and and we're not sure how to decide which n-grams are relevant. For example, “unemployment causes poverty” is certainly relevant. On the other hand, “unemployment is” or “the unemployed are” or “unemployment causes” are not relevant.

I would be interested in hearing about any established practices about how to distinguish relevant from non-relevant n-grams, or more generally any thoughts on how this can be done in a principled way other than making ad hoc decisions.

A solution we have considered so far is to exclude n-grams that only consist of function words in addition to a single content word that we already identified as a relevant keyword/1-gram. Other than this simple solution, we were wondering if there are more advanced approaches to this problem.

Thanks and best

Robert

-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/site/rflinguistics/

Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join Are you a non-native speaker of English? Please help us by taking this short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/ _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no<mailto:Corpora at uib.no> https://mailman.uib.no/listinfo/corpora ________________________________ Edge Hill University<http://ehu.ac.uk/home/emailfooter> Modern University of the Year, The Times and Sunday Times Good University Guide 2022<http://ehu.ac.uk/tef/emailfooter> University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.<http://ehu.ac.uk/itspolicies/emailfooter> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 12081 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211122/c3ce4cd9/attachment.txt>



More information about the Corpora mailing list