[Corpora-List] How to evaluate a synthetic text corpus?

Jayr Alencar Pereira jap2 at cin.ufpe.br
Wed Apr 13 13:54:06 CEST 2022


Hi, everybody.

I collected a list of 700 example sentences from domain specialists. And used this list as a basis for generating new 9 k sentences using a generative language model. Now, I am looking for methods for evaluating the quality of my generated corpus.

I have trained an n-gram language model using the generated corpus and measured the model perplexity in the specialists' sentences. I have good results on it, but I think I can evaluate it using other methods.

If you have any related research, please let me know.

Thank you in advance.

-- ** *Pax et bonum*

*Jayr Alencar Pereira.* PhD student Center of Informatics, Federal University of Pernambuco, Recife - Brazil Homepage: jayr.clubedosgeeks.com.br GitHub: @jayralencar <https://github.com/jayralencar> CV Lattes <http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K8561724U9> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2233 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20220413/54d5aeb5/attachment.txt>



More information about the Corpora mailing list