[Corpora-List] State of the Art: historical POS-Tagging EN?

Koos Wilt kooswilt at gmail.com
Fri Jun 2 18:34:23 CEST 2017

OK, Hermann, thank you for your interest and here goes. I am cross-posting this to the list as I believe, given, for instance Dr Emily Bender's last post, there is an interest in the use of linguistics techniques for Text Analytics.

This link has code behind it that classifies text files from OHSUMED, a data set for experimenting with PubMed/MEDLINE as belonging in the category HIV or TUBERCULOSIS. I recommend you run it. Don't forget to load the sets from OHSUMED. The success rate of the classification is 343/400 documents classified correctly by program classify.py.


The second link to GitHub has a program, 2classify.py that uses an ensemble of linguistic techniques to measure their influence on classification. The happy result of adding SUBJECT PREDICATE OBJECT triples, PoS tags, and WordNet expansion of the triples is that correct classification increases to 377/400. Some part of this is due to using hypernyms form Wordnet. It is my primitive implementation of this


and this


As you will see, program 2classify.py has code for PoS classification. The following needs to be stated at the beginning of your code to enable PoS tagging:

import nltk nltk.download() from nltk.corpus import wordnet as wn from nltk import word_tokenize,sent_tokenize from nltk.corpus import PlaintextCorpusReader from nltk import sent_tokenize, word_tokenize, pos_tag from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer()

In general, please consider my code has not been reviewed by peers, was written in hast for proof-of-concept, and has some stretches in main that should have been in subroutines. There is no documentation. Let me know whether I answered your question.

This code is self-explanatory regarding creating the tags with each word:

for file in full_file_paths: f = os.path.basename(file) note = open(file, "r") text=note.read() #print f H=HashTable() H[count]=text #print count #print H[count] if count < 400: count+=1 for c in string.punctuation: text= text.replace(c,"") tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) for tagged_token in tagged_tokens: word = tagged_token[0] for c in string.punctuation: word= word.replace(c,"") word_pos = tagged_token[1] if ((word_pos=="JJ") or (word_pos=="NN") or (word_pos=="VB") or (word_pos=="VBD") or (word_pos=="RB") or (word_pos=="NP") or (word_pos=="NNS") or (word_pos=="NP") or (word_pos=="BER") or (word_pos=="BEZ") or (word_pos=="MD") or (word_pos=="PRP") or (word_pos=="RB") or (word_pos=="VBG") or(word_pos=="VBN") or (word_pos=="VBP") or (word_pos=="RBR") or (word_pos=="RBS") or (word_pos=="JJR") or (word_pos=="JJS") or (word_pos=="PDT") or (word_pos=="DT") or (word_pos=="IN") or (word_pos=="PRP") or (word_pos=="PRP$") or (word_pos=="PRP") or (word_pos=="VBZ") or (word_pos=="WRB") or (word_pos=="PRP$") or (word_pos=="TO") or (word_pos=="WP") or (word_pos=="WRB")): lemma = wordnet_lemmatizer.lemmatize(word) #print lemma textLemmas=textLemmas+' '+lemma textLemmas='\''+textLemmas+' .\'\n'

2017-06-02 8:44 GMT+02:00 Herrmann, Berenike < jb.herrmann at phil.uni-goettingen.de>:

> Of course, Koos, I'd like to see it!
> Very best,
> Berenike
> ------------------------------
> *Von:* Koos Wilt [kooswilt at gmail.com]
> *Gesendet:* Donnerstag, 1. Juni 2017 18:30
> *An:* Herrmann, Berenike
> *Cc:* corpora at uib.no
> *Betreff:* Re: [Corpora-List] State of the Art: historical POS-Tagging EN?
> I used POS tagging in Python about two months ago for a study (of
> something else, but POS tagging was part of a 'linguistics techniques'
> ensemble). I have already forgotten exactly what it is I did, and whether
> it was part of NLKT proper or whether I got creative. I can drag it up and
> send it to you if you deem this useful.
> Best regards,
> -Koos
> 2017-06-01 16:33 GMT+02:00 Herrmann, Berenike <jb.herrmann at phil.uni-
> goettingen.de>:
>> Dear all,
>> We are preparing a project on lexico-semantic analyses of 18th/19th
>> Century __English-written__ texts from different written genres: __essays,
>> literary texts, also letters and diaries__. It's (mainly) British English.
>> I'd like to know the state of the art:
>> - What out-of-the box taggers (Tree Tagger, Perceptron, TnT, Stanford,
>> CLAWS, etc.) perform best on this type of data?
>> - What tagger types are possibly best suited? (HMM, maximum entropy, CRF,
>> etc.)
>> - Are there any historical/genre-specific language models available?
>> - How about tokenizers/orthographic normalization: Is either an issue for
>> British English of that period?
>> Any kind of pointer and/or assessment is welcome.
>> Many great thanks!!!
>> Very best,
>> Berenike
>> https://jberenike.github.io/
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 23100 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170602/a5b5ef8e/attachment.txt>

More information about the Corpora mailing list