DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Imbalanced Classification |
Model accuracy is not such an appropriate measure of performance when the data has a very imbalanced distribution of outcomes. For example, if positive cases account for just 1% of all cases, as might be the situation in an insurance dataset recording cases of fraud or in medical diagnoses for rare but terminal diseases, then the most accurate, but most useless, of models is one that predicts no fraud or diagnoses no disease in all cases. It will be 99% accurate! In such situations, the usual goal of the model builder, which is to build the most accurate model, does not match the actual goal of the model building.
There are two common approaches to dealing with imbalance: sampling and cost sensitive learning.
Before describing these two approaches to dealing with this issue, it is worth noting that some algorithms have no difficulty with building models from training data with imbalanced classes. Random forests, for example, need no such treatment of the training data in order to build models that capture under-represented classes quite well.