Unfortunately there are not good automated metrics for evaluating natural language generation (NLG) in general, though there may be some tools you can use to assess certain aspects of texts in certain contexts.

The 'gold standard' is to do some kind of human evaluation where you ask your participants to assess the quality of the text along whatever dimensions are most important for your task, usually including at least some assessment of 'fluency' (e.g. grammaticality, comprehensibility, clarity, naturalness, etc) and some assessment of adequacy (e.g. semantic/content accuracy w.r.t. the input, truthfulness, coherence, etc).

Shameless self-promotion: I led an effort a couple of years ago to look at how NLG evaluation has been done by the research community over the last 20 years (https://aclanthology.org/2020.inlg-1.23/), which might prove helpful.

Novikova et al. 2017 looked at automated metrics in a systematic way: http://aclweb.org/anthology/D17-1238

You can find pointers to work on human evaluations through the HumEval workshops (https://humeval.github.io/) and ReproGen shared tasks ( https://reprogen.github.io/).

> Hi, everybody.
> I collected a list of 700 example sentences from domain specialists. And
> used this list as a basis for generating new 9 k sentences using a
> generative language model. Now, I am looking for methods for evaluating the
> quality of my generated corpus.
> I have trained an n-gram language model using the generated corpus and
> measured the model perplexity in the specialists' sentences. I have good
> results on it, but I think I can evaluate it using other methods.
> If you have any related research, please let me know.
> Thank you in advance.
