You can find a possible (language- and domain-independent) approach in the following paper Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. /ACM Transactions on Speech and Language Processing (TSLP)/, /5/(3), 1-39. <https://dl.acm.org/doi/abs/10.1145/1410358.1410359?casa_token=fHcEx60RnqIAAAAA:5EjIOfQlhClYFC8e6AUNTryWK0YBYiBO6ySwi7KcBuqJnXK2ytB5uBsbSbk86dBK3oo72g52WA> Section 4.1 (Symbols, non-symbols). The approach, which is probabilistically supported, is also applicable to character n-grams.
Kind regards, George G.
Robert Fuchs wrote:
> *Dear all,*
> We are comparing a reference corpus and a target corpus in order to identify
> keywords and key phrases on a particular topic that is prominent in the target
> purpose but not in the reference corpus. We use log ratio and statistical
> significance in order to identify candidates for keywords, i.e. 1-grams, and
> then go through the rest manually in order to identify those that are relevant
> to the topic at hand (e.g. unemployment and labour relations). We remove items
> that are not relevant, for example if there was a random event like a
> particular sports tournament during the period of the target corpus.
> In addition, we are looking at n-grams with n greater 1 and and we're not sure
> how to decide which n-grams are relevant. For example, “unemployment causes
> poverty” is certainly relevant. On the other hand, “unemployment is” or “the
> unemployed are” or “unemployment causes” are not relevant.
> I would be interested in hearing about any established practices about how to
> distinguish relevant from non-relevant n-grams, or more generally any thoughts
> on how this can be done in a principled way other than making ad hoc decisions.
> A solution we have considered so far is to exclude n-grams that only consist
> of function words in addition to a single content word that we already
> identified as a relevant keyword/1-gram. Other than this simple solution, we
> were wondering if there are more advanced approaches to this problem.
> Thanks and best
> Prof. Dr. Robert Fuchs (JP) | Department of English Language and
> Literature/Institut für Anglistik und Amerikanistik | University of Hamburg |
> Überseering 35, 22297 Hamburg, Germany | Room 07076 |
> https://uni-hamburg.academia.edu/RobertFuchs |
> Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe
> here: https://groups.google.com/forum/#!forum/var-eng/join
> Are you a non-native speaker of English? Please help us by taking this short
> survey on when and how you use the English language:
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
*George Giannakopoulos, PhD*
/Researcher/ Home page <http://www.iit.demokritos.gr/~ggianna> SKEL Lab - NCSR Demokritos <http://www.iit.demokritos.gr>
/Co-founder, Chief Executive Officer/ SciFY Not-for-Profit Company <http://www.scify.org>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7874 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211119/8da6aad7/attachment.txt>