[Corpora-List] Identifying relevant key-n-grams (in analogy to keywords)

George Giannakopoulos ggianna at iit.demokritos.gr
Fri Nov 19 20:34:48 CET 2021


Dear Robert,

You can find a possible (language- and domain-independent) approach in the following paper Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. /ACM Transactions on Speech and Language Processing (TSLP)/, /5/(3), 1-39. <https://dl.acm.org/doi/abs/10.1145/1410358.1410359?casa_token=fHcEx60RnqIAAAAA:5EjIOfQlhClYFC8e6AUNTryWK0YBYiBO6ySwi7KcBuqJnXK2ytB5uBsbSbk86dBK3oo72g52WA> Section 4.1 (Symbols, non-symbols). The approach, which is probabilistically supported, is also applicable to character n-grams.

Kind regards, George G.

Robert Fuchs wrote:
>
> **
>
> *Dear all,*
>
> *
>
> We are comparing a reference corpus and a target corpus in order to identify
> keywords and key phrases on a particular topic that is prominent in the target
> purpose but not in the reference corpus. We use log ratio and statistical
> significance  in order to identify candidates for keywords, i.e. 1-grams, and
> then go through the rest manually in order to identify those that are relevant
> to the topic at hand (e.g. unemployment and labour relations). We remove items
> that are not relevant, for example if there was a random event like a
> particular sports tournament during the period of the target corpus.
>
>
> In addition, we are looking at n-grams with n greater 1 and and we're not sure
> how to decide which n-grams are relevant. For example, “unemployment causes
> poverty” is certainly relevant. On the other hand, “unemployment is” or “the
> unemployed are” or “unemployment causes” are not relevant.
>
>
> I would be interested in hearing about any established practices about how to
> distinguish relevant from non-relevant n-grams, or more generally any thoughts
> on how this can be done in a principled way other than making ad hoc decisions.
>
>
> A solution we have considered so far is to exclude n-grams that only consist
> of function words in addition to a single content word that we already
> identified as a relevant keyword/1-gram. Other than this simple solution, we
> were wondering if there are more advanced approaches to this problem.
>
>
> Thanks and best 
>
> Robert
>
> *
> --
> Prof. Dr. Robert Fuchs (JP) | Department of English Language and
> Literature/Institut für Anglistik und Amerikanistik | University of Hamburg |
> Überseering 35, 22297 Hamburg, Germany | Room 07076 |
> https://uni-hamburg.academia.edu/RobertFuchs |
> https://sites.google.com/site/rflinguistics/
>
>
> Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe
> here: https://groups.google.com/forum/#!forum/var-eng/join
> Are you a non-native speaker of English? Please help us by taking this short
> survey on when and how you use the English language:
> https://lamapoll.de/englishusageofnonnativespeakers-1/
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora

-- --------------------------------------------------------------------------------

*George Giannakopoulos, PhD*

/Researcher/ Home page <http://www.iit.demokritos.gr/~ggianna> SKEL Lab - NCSR Demokritos <http://www.iit.demokritos.gr>

and

/Co-founder, Chief Executive Officer/ SciFY Not-for-Profit Company <http://www.scify.org>

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7874 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211119/8da6aad7/attachment.txt>



More information about the Corpora mailing list