the issue is managed in Weka via Cost Sensitive Classification (see e.g. http://wekadocs.com/node/15), which allows you to provide algorithms with a Cost Matrix expressing per-category penalties for misclassified examples. I am not sure if you can instantiate a cost sensitive classifier over any learning algorithm in Weka (and decision trees in particular), but you might definitely want to check it.
On 21/apr/09, at 16:33, Eddie Bell wrote:
> Hi Emmanuel,
> I recently had a similar unbalanced data-set (98% 'No') and used an
> SVM with prior weights. The prior weights force the model to account
> for the recessive category by penalizing the classification errors of
> the dominant category (i.e. making recessive class accuracy more
> SVMs aren't as interpretable as decision trees, if trees are required
> I believe the 'rpart' R package supports weighting. I'm not familiar
> enough with weka to guide you in that respect but weights should help
> with your problem.
> - eddie
> 2009/4/21 Emmanuel Prochasson <emmanuel.prochasson at univ-nantes.fr>:
>> Dear all,
>> I would like to build a decision tree (or whatever supervised
>> relevant) on a set of data containing 0.1% "Yes" and 99.9% "No",
>> several attributes (12 for now, but I have to tune that). I use Weka,
>> which is totally awesome.
>> My goal is to prune search space for another application (ie : remove
>> say, 80% of the data that are very unlikely to be "Yes"), that's
>> why I'm
>> trying to use a decision tree. Of course some algorithm returns a
>> 1 leaf
>> node tree tagged "No", with a 99.9% precision, which is pretty
>> but ensure I will always withdraw all of my search space rather than
>> prune it.
>> My problem is : is there a way (algorithm ? software ?) to build a
>> that will maximise recall (all "Yes" elements tagged "Yes" by the
>> algorithm). I don't really care about precision (It's ok if many "No"
>> elements are tagged "Yes" -- I can handle false positive).
>> In other word, is there a way to build a decision tree under the
>> constraint of 100% recall ?
>> I'm not sure I made myself clear, and I'm not sure there are
>> for my problem.
>> Corpora mailing list
>> Corpora at uib.no
> Edward J. L. Bell
> C28, Computing Department,
> Infolab 21, Lancaster University
> +44(0) 15245 10348
> Corpora mailing list
> Corpora at uib.no