[Corpora-List] Decision tree : maximise recall over precision

Stefanie Tellex stefie10
Tue Apr 21 17:54:41 CEST 2009

I too would be interested to hear what you end up doing, since I'm also using decision trees for information retrieval.

I was thinking about the following hack: duplicate the "yes" training examples N times. This will prevent the "yes" nodes in the tree from getting pruned because they contain too few examples compared to the "no" nodes. I was actually writing code to do that right now because I have a problem where I have trustworthy positive examples and less trustworthy negative examples.

Another approach might be to try clustering, to turn it into a multi-class problem. If it turns out there are clusters that only contain negative examples that you can identify with high precision, then you can throw out examples classified into those clusters.

Note that it sounds like you actually do not want to maximize recall - you could trivially do that by simply returning all results in the corpus. It might be more helpful to think about maximizing weighted F-score, where the weight is biased towards recall.


Eric Atwell pisze:
> Enmmanuel,
> Surely a good decision procedure is "JUST SAY NO!" - "only" 99.9% accurate!
> I wish PoS-taggers and other text annotation tools were as good!
> It sounds like you want to find out how to set a WEKA decision-tree
> builder to NOT prune any branches ... this question is better put to
> the WEKA mailing list wekalist at list.scms.waikato.ac.nz - see
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist to join
> Eric Atwell, Leeds University
> PS - please let me know if you find the answer - this looks like an
> interesting class coursework exercise!
> On Tue, 21 Apr 2009, Emmanuel Prochasson wrote:
>> Dear all,
>> I would like to build a decision tree (or whatever supervised classifier
>> relevant) on a set of data containing 0.1% "Yes" and 99.9% "No", using
>> several attributes (12 for now, but I have to tune that). I use Weka,
>> which is totally awesome.
>> My goal is to prune search space for another application (ie : remove
>> say, 80% of the data that are very unlikely to be "Yes"), that's why I'm
>> trying to use a decision tree. Of course some algorithm returns a 1 leaf
>> node tree tagged "No", with a 99.9% precision, which is pretty accurate,
>> but ensure I will always withdraw all of my search space rather than
>> prune it.
>> My problem is : is there a way (algorithm ? software ?) to build a tree
>> that will maximise recall (all "Yes" elements tagged "Yes" by the
>> algorithm). I don't really care about precision (It's ok if many "No"
>> elements are tagged "Yes" -- I can handle false positive).
>> In other word, is there a way to build a decision tree under the
>> constraint of 100% recall ?
>> I'm not sure I made myself clear, and I'm not sure there are solutions
>> for my problem.
>> Regards,

More information about the Corpora mailing list