Data Mining Survivor: Tutorial_Example

DATA MINING
Desktop Survival Guide
by Graham Williams

Rattle

The output from the decision tree building process includes much information. We work our way through the information.

Summary of the Decision Tree model for Classification (built using 'rpart'): n= 256

Next, the structure of the tree is presented. First is provided a legend to be able to read the tree.

node), split, n, loss, yval, (yprob) * denotes terminal node

This tells us that a node number will be provided, followed by a split or test (var op value), the number of entities at that node, how many entities are incorrectly classified (the loss), the default classification for the node (yval), and then the distribution of classes in that node (yprob). The distribution is ordered by the classes, and is the same order for all nodes. We are told that a ``*'' denotes a terminal node of the tree (i.e., the tree is not split any further at that node).

The first node of any tree is always the root node. We now work our way into the tree itself.

1) root 256 41 No (0.83984375 0.16015625)

The root node represents all 256 observations. Stopping at the root node, in building a model, represents a model that simply classifies every one with whatever class is the majority in the training dataset. Skipping the 256 for a moment, we see that the report tells us that the majority class for the root node (the yval) is NA 256 then tells us how many of the 256 will be incorrectly classified as NA technically called the loss.

The default class is 0, and for this node 76.07% of the entities have a 0 classification and 23.93% have 1.

The next sub-node splits the root entities into one of two groups, identifying those with specific values for the Marital variable (the full list is replaced with [...] here). There are 746 entities in this group of which 47 will be incorrect when we take the default class as 0. The class distribution is 93.85% 0 and 6.15% 1. The '*' indicates that his node is not split any further--that is, it is a terminal node.

2) Marital=Absent, [...] 764 47 0 (0.93848168 0.06151832) *

The other side of the Marital split is then split further. We can see that node 13, for example, has split on the variable Deductions with a test of . There are only 8 entites in this node, with none incorrectly classified, the classification being 1, and the class distribution being 0% 0 and 100% 1. This is also a terminal node.

3) Marital=Married 636 288 0 (0.54716981 0.45283019) 6) Occupation=Cleaner, [...] 282 65 0 (0.76950355 0.23049645) 12) Deductions< 1679.667 274 57 0 (0.79197080 0.20802920) * 13) Deductions>=1679.667 8 0 1 (0.00000000 1.00000000) *

The rest of the tree is:

7) Occupation=Clerical,[...] 354 131 1 (0.37005650 0.62994350) 14) Education=Associate,[...] 165 81 1 (0.49090909 0.50909091) 28) Age< 33.5 36 9 0 (0.75000000 0.25000000) * 29) Age>=33.5 129 54 1 (0.41860465 0.58139535) 58) Age>=62 14 3 0 (0.78571429 0.21428571) * 59) Age< 62 115 43 1 (0.37391304 0.62608696) * 15) Education=Bachelor,[...] 189 50 1 (0.26455026 0.73544974) *

Next is listed the command line call to the rpart function:

Classification tree: rpart(formula = Adjusted ~ ., data = crs$dataset[crs$sample, c(2:10, 13)], method = "class")

Variables actually used in tree construction: [1] Age Deductions Education Marital Occupation

Root node error: 335/1400 = 0.23929 n= 1400

The complexity table is useful:

CP nsplit rel error xerror xstd 1 0.137313 0 1.00000 1.00000 0.047653 2 0.026866 2 0.72537 0.75821 0.043043 3 0.023881 4 0.67164 0.79403 0.043817 4 0.010000 6 0.62388 0.77910 0.043498

Finally, we get to see how long it took to build the tree:

Time taken: 0.14 secs

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010