DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
The generalisation error rate from random forests tends to compare favourably to boosting approaches, yet the approach tends to be more robust to noise in the training dataset, and so tends to be a very stable model builder, not suffering the sensitivity to noise in a dataset that single decision tree induction does. The general observation is that the random forest model builder is very competitive with nonlinear classifiers such as artificial neural nets and support vector machines. However, performance is often dataset dependent and so it remains useful to try a suite of approaches.
Each decision tree is built from a random subset of the training dataset, using what is called replacement (thus it is doing what is known as bagging), in performing this sampling. That is, some entities will be included more than once in the sample, and others won't appear at all. Generally, about two thirds of the entities will be included in the subset of the training dataset, and one third will be left out.
In building each decision tree model based on a different random subset of the training dataset a random subset of the available variables is used to choose how best to partition the dataset at each node. Each decision tree is built to its maximum size, with no pruning performed.
Together, the resulting decision tree models of the forest represent the final ensemble model where each decision tree votes for the result, and the majority wins. (For a regression model the result is the average value over the ensemble of regression trees.)
In building the random forest model we have options to choose the number of trees to build, to choose the training dataset sample size to use for building each decision tree, and to choose the number of variables to randomly select when considering how to partition the training dataset at each node. The random forest model builder can also report on the input variables that are actually most important in determining the values of the output variable.
By building each decision tree to its maximal depth (i.e., by not pruning the decision tree) we can end up with a model that is less biased.
The randomness introduced by the random forest model builder in the dataset selection and in the variable selection delivers considerable robustness to noise, outliers, and over-fitting, when compared to a single tree classifier.
The randomness also delivers substantial computational efficiencies. In building a single decision tree the model builder may select a random subset of the training dataset. Also, at each node in the process of building the decision tree, only a small fraction of all of the available variables are considered when determining how to best partition the dataset. This substantially reduces the computational requirement.
In summary, a random forest model is a good choice for model building for a number of reasons. First, just like decision trees, very little, if any, pre-processing of the data needs to be performed. The data does not need to be normalised and the approach is resiliant to outliers. Second, if we have many input variables, we generally do not need to do any variable selection before we begin model building. The random forest model builder is able to target the most useful variables. Thirdly, because many trees are built and there are two levels of randomness and each tree is effectively an independent model, the model builder tends not to overfit to the training dataset.