After giza++ is done iterating I normally use the phrase tables for various purposes. I currently work on a little project of translations and I'd like to retrieve parallel sentences for translations with low probabilities. So I look up in the phrase table translation equivalents with low probability - but still valid ones, and my aim is to retrieve those sentences which include the source-language phrase in a sentence and the respective target-language sentences which contain the lower probability translation equivalents.
Looked at by example, consider the preposition 'fii' in Arabic which translates normally to 'in', then to NULL, then to 'on', 'at', 'by' 'into' and more (trained on UN corpus). How would you retrieve lines where the source sentences include 'fii', and the target ones NULL or 'at' but not 'in'. Since 'in' is very common generally, searching for lines which *do not* include 'in', but do include 'at' or NULL yield results with low recall, and also, it's not necessary precise, since these low-probability equivalents can just as well be translations of something else (and they normally are).
I was wondering: (a) whether in one of the output files there is information about the most recent probability update and which sentences these updates were taken from? (b) whether using this method to retrieve the desired sentences is viable? (c) are there better ways to go about it?
Thanks a lot, Noam -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1595 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180224/b3dfaa21/attachment.txt>