Your contributions give me a direction on how to evaluate my proposal. I think a human evaluation is the most appropriate way to evaluate my proposal. I already have sentences constructed by humans, but I used them as a basis for automatically producing more, so I can insert some bias on the evaluation if I use them. Based on your paper, I think the next challenging step is to define the evaluation criteria.
Em qua., 13 de abr. de 2022 ās 10:08, David Howcroft < dave.howcroft at gmail.com> escreveu:
> Hi Jayr,
> Unfortunately there are not good automated metrics for evaluating natural
> language generation (NLG) in general, though there may be some tools you
> can use to assess certain aspects of texts in certain contexts.
> The 'gold standard' is to do some kind of human evaluation where you ask
> your participants to assess the quality of the text along whatever
> dimensions are most important for your task, usually including at least
> some assessment of 'fluency' (e.g. grammaticality, comprehensibility,
> clarity, naturalness, etc) and some assessment of adequacy (e.g.
> semantic/content accuracy w.r.t. the input, truthfulness, coherence, etc).
> Shameless self-promotion: I led an effort a couple of years ago to look at
> how NLG evaluation has been done by the research community over the last 20
> years (https://aclanthology.org/2020.inlg-1.23/), which might prove
> Novikova et al. 2017 looked at automated metrics in a systematic way:
> You can find pointers to work on human evaluations through the HumEval
> workshops (https://humeval.github.io/) and ReproGen shared tasks (
> Happy to talk more offline if you would like :)
> David M. Howcroft
> On Wed, Apr 13, 2022 at 1:06 PM Jayr Alencar Pereira <jap2 at cin.ufpe.br>
>> Hi, everybody.
>> I collected a list of 700 example sentences from domain specialists. And
>> used this list as a basis for generating new 9 k sentences using a
>> generative language model. Now, I am looking for methods for evaluating the
>> quality of my generated corpus.
>> I have trained an n-gram language model using the generated corpus and
>> measured the model perplexity in the specialists' sentences. I have good
>> results on it, but I think I can evaluate it using other methods.
>> If you have any related research, please let me know.
>> Thank you in advance.
>> ** *Pax et bonum*
>> *Jayr Alencar Pereira.*
>> PhD student
>> Center of Informatics, Federal University of Pernambuco, Recife - Brazil
>> Homepage: jayr.clubedosgeeks.com.br
>> GitHub: @jayralencar <https://github.com/jayralencar>
>> CV Lattes
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
-- ** *Pax et bonum*
*Jayr Alencar Pereira.* PhD student Center of Informatics, Federal University of Pernambuco, Recife - Brazil Homepage: jayr.clubedosgeeks.com.br GitHub: @jayralencar <https://github.com/jayralencar> CV Lattes <http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K8561724U9> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7285 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20220419/c45b24d4/attachment.txt>