DATA MINING
Desktop Survival Guide
by Graham Williams

Imbalanced Classification

Model accuracy is not such an appropriate measure of performance when the data has a very imbalanced distribution of outcomes. For example, if positive cases account for just 1% of all cases, as might be the situation in an insurance dataset recording cases of fraud or in medical diagnoses for rare but terminal diseases, then the most accurate, but most useless, of models is one that predicts no fraud or diagnoses no disease in all cases. It will be 99% accurate! In such situations, the usual goal of the model builder, which is to build the most accurate model, does not match the actual goal of the model building.

There are two common approaches to dealing with imbalance: sampling and cost sensitive learning.

Before describing these two approaches to dealing with this issue, it is worth noting that some algorithms have no difficulty with building models from training data with imbalanced classes. Random forests, for example, need no such treatment of the training data in order to build models that capture under-represented classes quite well.

Subsections

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010