DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Sampling |
A traditional approach to solving class imbalance, and one that works well in many modelling situations, but not all, is to sample your data. Sampling aims to remove or at least redress the balance. It is a data preprocessing step whereby the algorithm used by the model builder does not generally need to be modified. Because of this, the approach is readily applicable (but not necessarily appropriate) to any model builder.
MENTION APPROACHES AND FOR EACH ILLUSTRATE HOW TO DO IT IN R.
Random undersampling will randomly choose a subset of the over represented class (or classes) to approach the same number as the underpresented class (or classes) for inclusion in the training dataset.
Random oversampling will randomly duplicate records from the under-represented class (or classes) for inclusion in the training dataset.
The synthetic minority oversampling technique (SMOTE).
Cluster-based oversampling.
One-sided selection.
Wilson's editing.