|   | DATA MINING Desktop Survival Guide by Graham Williams |   | |||
| Training and Test Datasets | 
Often in modelling we build our model on a training set and then test
its performance on a test set. The simplest approach to generating a
partitioning of your dataset into a training and test set is with
the sample function:
| > sub <- sample(nrow(iris), floor(nrow(iris) * 0.8)) > iris.train <- iris[sub, ] > iris.test <- iris[-sub, ] | 
The sample.split function of the caTools
package also comes in handy here. It will split a vector into two
subsets, two thirds in one and one third in the other, maintaining the
relative ratio of the different categoric values represented in the
vector:
| 
> mask <- sample.split(iris$Species)
> mask
  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[...]
[145]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
> table(iris$Species)
    setosa versicolor  virginica
        50         50         50
> table(iris$Species[mask])
    setosa versicolor  virginica
        33         33         33
> table(iris$Species[!mask])
    setosa versicolor  virginica
        17         17         17
 |