We have encountered similar problems with skewed data in machine learning tasks such as phrase break prediction for English and Arabic, where baseline accuracy is very high at around 79% and 85% respectively.
We have also encountered an extreme case of data skew when classifying unseen texts using keywords and key bigrams as features derived from a training set of examples associated with a given concept on the EPSRC-funded "Making Sense" project.
One of our recommendations is to compare results from different taggers. Another recommendation is to use a combination of performance metrics because accuracy alone does not necessarily denote success at minority class recognition. We often compare accuracy versus balanced classification rate, where the latter is easy to compute - see for example our LREC paper: "Predicting Phrase Breaks in Classical and Modern Standard Arabic Text" (Sawalha et al., 2012).
Currently we are researching/implementing alternatives to supervised machine learning for Arabic phrase break prediction, as we use an exemplary but relatively small and finite dataset.
Claire Brierley Computing University of Leeds, UK ________________________________________ From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Emad Mohamed [emohamed at umail.iu.edu] Sent: 03 December 2012 15:10 To: Detmar Meurers; corpora at uib.no Subject: Re: [Corpora-List] Question about evaluation
Thank you all for the valuable advice. I think I'll go with making one of them the positive class and the other the negative one, and measure precision, recall and the F-score. Thank you again.
On Mon, Dec 3, 2012 at 6:54 AM, Detmar Meurers <dm at sfs.uni-tuebingen.de<mailto:dm at sfs.uni-tuebingen.de>> wrote: Hi,
an option you could consider is to sample an equal number of both, so that you get a random baseline of 50%.
Given the low frequency of class E, you could just take all instances of class E and then randomly select the same number of instances of Class S.
Then you can report 10-fold cross validation results for this balanced data set.
While this is useful to get a good grip on the performance of your features and classifier setup, in case you want to test the performance for a real-world application, you'll want to take into account that one class is much more prominent in the data that real-world application needs to be dealing with. Depending on what the application is supposed to do, you'd then maximize precision or recall for the class you're most interested in.
-- Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany
On Sun, Dec 02, 2012 at 05:13:55PM -0500, Emad Mohamed wrote:
> Hello Corpora members,
> I have a corpus of 80,000 words in which each word is assigned either the
> class S or the class E. Class S occurs 72,000 times while class E occurs
> 8,000 times only.
> I'm wondering what the best way to evaluate the classifier performance
> should be. I have randomly selected a dev set (5%) and a test set (10%).
> I'm mainly interested in predicting which words are class E.
> I've read this page:
> but I'm still a little bit confused. Do we use specificity in linguistics
> papers? Should I report these measures for each of the two classes or a as
> a general number? Does this make sense / a difference?
> Thank you so much.
> Emad Mohamed
> aka Emad Nawfal
> Université du Québec à Montréal
-- Emad Mohamed aka Emad Nawfal Université du Québec à Montréal