Clearly the recall and precision on the second script are higher. Of course, without knowing what the total number of words that should be generated is, it's hard to say more. In particular, it's hard to say whether 471 is good. (Is the second script getting 471 out of 500 possible, or 471 out of 50,000?)
In general, though, I think comparing at this gross level is only going to give a general sort of answer. What you really want is a test set where each input word is paired with its expected output word, so you can do error analysis and regression testing.