|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Training and Test Datasets |
Often in modelling we build our model on a training set and then test
its performance on a test set. The simplest approach to generating a
partitioning of your dataset into a training and test set is with
the sample function:
> sub <- sample(nrow(iris), floor(nrow(iris) * 0.8)) > iris.train <- iris[sub, ] > iris.test <- iris[-sub, ] |
The sample.split function of the caTools
package also comes in handy here. It will split a vector into two
subsets, two thirds in one and one third in the other, maintaining the
relative ratio of the different categoric values represented in the
vector:
> mask <- sample.split(iris$Species)
> mask
[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
[...]
[145] TRUE TRUE TRUE TRUE FALSE TRUE
> table(iris$Species)
setosa versicolor virginica
50 50 50
> table(iris$Species[mask])
setosa versicolor virginica
33 33 33
> table(iris$Species[!mask])
setosa versicolor virginica
17 17 17
|