DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Training and Test Datasets |
Often in modelling we build our model on a training set and then test
its performance on a test set. The simplest approach to generating a
partitioning of your dataset into a training and test set is with
the sample function:
> sub <- sample(nrow(iris), floor(nrow(iris) * 0.8)) > iris.train <- iris[sub, ] > iris.test <- iris[-sub, ] |
The sample.split function of the caTools
package also comes in handy here. It will split a vector into two
subsets, two thirds in one and one third in the other, maintaining the
relative ratio of the different categoric values represented in the
vector:
> mask <- sample.split(iris$Species) > mask [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE [...] [145] TRUE TRUE TRUE TRUE FALSE TRUE > table(iris$Species) setosa versicolor virginica 50 50 50 > table(iris$Species[mask]) setosa versicolor virginica 33 33 33 > table(iris$Species[!mask]) setosa versicolor virginica 17 17 17 |