Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Min Bucket (minbucket)

The Rarg[]minbucket is the minimum number of observations in any terminal leaf node.

The two variables Rarg[]minbucket and Rarg[]minsplit are closely related. In rpart if either is not specified then by default the other is calculated as $minsplit = 3*minbucket$.

Using rpart directly we specify Roption[]minbucket within an option called Roption[]control which takes the results from a function called rpart.control. In this example we



> audit <- read.csv(url("http://rattle.togaware.com/audit.csv"))
> audit.rpart <- rpart(TARGET_Adjusted ~ Age + Marital 
                                             + Occupation 
                                             + Deductions, 
                       data=audit,
                       method="class", 
                       control=rpart.control(minbucket=100))
> audit.rpart

Changing Rarg[]minbucket can result in different variables being chosen at different nodes. Compare the tree obtain with the command above (with Rarg[]minbucket set to 100) to the result when Rarg[]minbucket is set to 10. Note how node 7 was originally split using Age but with the minimum bucket size set to 10 the node is split on Deductions. We can see why -- the resulting node 15 has only 30 entities:



[...] 
  control=rpart.control(minbucket=100))
[...]
   7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721)  
    14) Age< 36.5 151  72 0 (0.52317881 0.47682119) *
    15) Age>=36.5 365 128 1 (0.35068493 0.64931507) *


[...]
  control=rpart.control(minbucket=10))
[...]
    7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721)  
     14) Deductions< 1299.833 486 207 1 (0.42592593 0.57407407)  
[...]
     15) Deductions>=1299.833 30   0 1 (0.00000000 1.00000000) *

Whilst the default is to set Rarg[]minbucket to be one third of Rarg[]minsplit there is no requirement for Rarg[]minbucket to be less than Rarg[]minsplit. A node will always have at least Rarg[]minbucket entities, and it will be considered for splitting if it has at least Rarg[]minsplit entities and on splitting, each of its children have at least Rarg[]minbucket entities.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010