Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Random Survival Forests

Example 1: Veteran's Administration lung cancer trial from Kalbfleisch and Prentice. Randomized trial of two treatment regimens for lung cancer. Minimal argument call. Print results, then plot error rate and importance values.



> library(randomSurvivalForest)



randomSurvivalForest 3.6.2

Type rsf.news() to see new features, changes, and bug fixes.



> data(veteran, package = "randomSurvivalForest")
> veteran.out <- rsf(Survrsf(time, status)~., data = veteran)
> print(veteran.out)



Call:
 rsf.default(formula = Survrsf(time, status) ~ ., data = veteran) 

                         Sample size: 137
                    Number of deaths: 128
                     Number of trees: 1000
          Minimum terminal node size: 3
       Average no. of terminal nodes: 21.38
No. of variables tried at each split: 2
              Total no. of variables: 6
                      Splitting rule: logrank
              Estimate of error rate: 29.53%



> plot(veteran.out)



           Importance   Relative Imp
karno          0.1386         1.0000
celltype       0.0363         0.2618
diagtime       0.0045         0.0326
prior          0.0016         0.0114
trt            0.0003         0.0024
age           -0.0027        -0.0196

Image dmsurvivor-190

Example 2: Richer argument call (veteran data). Forest is saved by setting 'forest' option to true (see 'rsf.predict' for more details about prediction). Coerce variable 'celltype' as a factor, and karnofsky score as an ordered factor to illustrate factor useage in RSF. Use random splitting with 'nsplit'. Use 'varUsed' option.



>      data(veteran, package = "randomSurvivalForest")
>      veteran.f <- as.formula(Survrsf(time, status)~.)
>      veteran$celltype <- factor(veteran$celltype,
           labels=c("squamous", "smallcell",  "adeno",  "large"))
>      veteran$karno <- factor(veteran$karno, ordered = TRUE)
>      ntree <- 200
>      mtry <- 2
>      nodesize <- 3
>      splitrule <- "logrank"
>      nsplit <- 10
>      varUsed <- "by.tree"
>      forest <- TRUE
>      proximity <- TRUE
>      do.trace <- 1
>      veteran2.out <- rsf(veteran.f, veteran, ntree,
              mtry, nodesize, splitrule, nsplit,
              varUsed = varUsed, forest = forest, 
              proximity = proximity, do.trace = do.trace)
>      print(veteran2.out)



Call:
 rsf.default(formula = veteran.f, data = veteran, ntree = ntree,      mtry = mtry, nodesize = nodesize, splitrule = splitrule,      nsplit = nsplit, forest = forest, proximity = proximity,      varUsed = varUsed, do.trace = do.trace) 

                         Sample size: 137
                    Number of deaths: 128
                     Number of trees: 200
          Minimum terminal node size: 3
       Average no. of terminal nodes: 21.385
No. of variables tried at each split: 2
              Total no. of variables: 6
                      Splitting rule: logrank *random*
       Number of random split points: 10
              Estimate of error rate: 29.67%



>      plot.proximity(veteran2.out)

Take a peek at the forest ...



>      head(veteran2.out$forest$nativeArray)



  treeID nodeID parmID contPT mwcpSZ
1      1      1      3      4      0
2      1      1      2     NA      1
3      1      1      0     NA      0
4      1      3      5     54      0
5      1      3      5     42      0
6      1      3      0     NA      0

Average number of times a variable was split on.



>      apply(veteran2.out$varUsed,2,mean)



     trt celltype    karno diagtime      age    prior 
   1.775    3.240    4.810    4.295    5.200    1.065

Partial plot of top variable.



>      plot.variable(veteran2.out, partial = TRUE, npred=1)

Image dmsurvivor-194



Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010