DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Model Tuning |
What is the right value to use for each of the variables of the model building algorithms that we us in data mining? The variable settings can make the difference between a good and a poor model.
The package caret, as well as providing a unified interface to many of the model builders we have covered in this book, provides a parameter tuning approach. Here's a couple of examples:
> library(rattle) > library(caret) > data(audit) > mysample <- sample(nrow(audit), 1400) > myrpart <- train(audit[mysample, c(2,4:5,7:10)], as.factor(audit[mysample, c(13)]), "rpart") Model 1: maxdepth=6 collapsing over other values of maxdepth > myrpart Call: train.default(x = audit[mysample, c(2, 4:5, 7:10)], y = as.factor(audit[mysample, c(13)]), method = "rpart") 1400 samples, 7 predictors largest class: 77.71% (0) summary of bootstrap (25 reps) sample sizes: 1400, 1400, 1400, 1400, 1400, 1400, ... boot resampled training results across tuning parameters: maxdepth Accuracy Kappa Accuracy SD Kappa SD Optimal 2 0.817 0.423 0.0142 0.0386 3 0.818 0.413 0.0171 0.0617 * 6 0.814 0.412 0.019 0.0488 Accuracy was used to select the optimal model > myrpart$finalModel n= 1400 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 1400 312 0 (0.77714286 0.22285714) 2) Marital=Absent,Divorced,Married-spouse-absent,Unmarried,Widowed 773 38 0 (0.95084088 0.04915912) * 3) Marital=Married 627 274 0 (0.56299841 0.43700159) 6) Education=College,HSgrad,Preschool,Vocational,Yr10,Yr11,Yr12,Yr1t4,Yr5t6,Yr7t8,Yr9 409 129 0 (0.68459658 0.31540342) 12) Deductions< 1708 400 120 0 (0.70000000 0.30000000) * 13) Deductions>=1708 9 0 1 (0.00000000 1.00000000) * 7) Education=Associate,Bachelor,Doctorate,Master,Professional 218 73 1 (0.33486239 0.66513761) * |
Similarly we can replace rpart with rf.
The tune function from the e1071 package provides a simple, if sometimes computationally expensive, approach to find a good value for a collection of tuning variables. We explore the use of this function here.
The tune function provides a number of global tuning variables that affect how the tuning happens. The Rarg[]nrepeat variable (number of repeats) specifies how often the training should be repeated. The Rarg[]repeat.aggregate variable identifies a function that specifies how to combine the training results over the repeated training. The Rarg[]sampling identifies the sampling scheme to use, allowing for cross-validation, bootstrapping or a simple train/test split. For each type of sample, further variables are supplied, including, for example, cross = 10 to set the cross validation to be 10-fold. The Rarg[]sampling.aggregate variable specifies a function to combine the training results over the various training samples. A good default (provided by tune) is to train once with 10-fold cross validation.