Note that NLTK just uses Punkt, and this won't necessarily perform well if it uses an off-the-shelf model that was trained on data that contained different abbreviations to the test data:
"Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text." http://nltk.org/api/nltk.tokenize.html
-Steven Bird