On Mon, Dec 3, 2012 at 6:54 AM, Detmar Meurers <dm at sfs.uni-tuebingen.de>wrote:
> an option you could consider is to sample an equal number of both, so
> that you get a random baseline of 50%.
> Given the low frequency of class E, you could just take all
> instances of class E and then randomly select the same number of
> instances of Class S.
> Then you can report 10-fold cross validation results for this balanced
> data set.
> While this is useful to get a good grip on the performance of your
> features and classifier setup, in case you want to test the
> performance for a real-world application, you'll want to take into
> account that one class is much more prominent in the data that
> real-world application needs to be dealing with. Depending on what the
> application is supposed to do, you'd then maximize precision or recall
> for the class you're most interested in.
> Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm
> Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany
> On Sun, Dec 02, 2012 at 05:13:55PM -0500, Emad Mohamed wrote:
> > Hello Corpora members,
> > I have a corpus of 80,000 words in which each word is assigned either the
> > class S or the class E. Class S occurs 72,000 times while class E occurs
> > 8,000 times only.
> > I'm wondering what the best way to evaluate the classifier performance
> > should be. I have randomly selected a dev set (5%) and a test set (10%).
> > I'm mainly interested in predicting which words are class E.
> > I've read this page:
> > webdocs.cs.ualberta.ca/~eisner/measures.html
> > but I'm still a little bit confused. Do we use specificity in linguistics
> > papers? Should I report these measures for each of the two classes or a
> > a general number? Does this make sense / a difference?
> > Thank you so much.
> > --
> > Emad Mohamed
> > aka Emad Nawfal
> > Université du Québec à Montréal
-- Emad Mohamed aka Emad Nawfal Université du Québec à Montréal -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3038 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121203/31188706/attachment.txt>