I recently had a similar unbalanced data-set (98% 'No') and used an SVM with prior weights. The prior weights force the model to account for the recessive category by penalizing the classification errors of the dominant category (i.e. making recessive class accuracy more important).
SVMs aren't as interpretable as decision trees, if trees are required I believe the 'rpart' R package supports weighting. I'm not familiar enough with weka to guide you in that respect but weights should help with your problem.
2009/4/21 Emmanuel Prochasson <emmanuel.prochasson at univ-nantes.fr>:
> Dear all,
> I would like to build a decision tree (or whatever supervised classifier
> relevant) on a set of data containing 0.1% "Yes" and 99.9% "No", using
> several attributes (12 for now, but I have to tune that). I use Weka,
> which is totally awesome.
> My goal is to prune search space for another application (ie : remove
> say, 80% of the data that are very unlikely to be "Yes"), that's why I'm
> trying to use a decision tree. Of course some algorithm returns a 1 leaf
> node tree tagged "No", with a 99.9% precision, which is pretty accurate,
> but ensure I will always withdraw all of my search space rather than
> prune it.
> My problem is : is there a way (algorithm ? software ?) to build a tree
> that will maximise recall (all "Yes" elements tagged "Yes" by the
> algorithm). I don't really care about precision (It's ok if many "No"
> elements are tagged "Yes" -- I can handle false positive).
> In other word, is there a way to build a decision tree under the
> constraint of 100% recall ?
> I'm not sure I made myself clear, and I'm not sure there are solutions
> for my problem.
> Corpora mailing list
> Corpora at uib.no
-- Edward J. L. Bell C28, Computing Department, Infolab 21, Lancaster University
+44(0) 15245 10348