DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Min Bucket (minbucket) |
The Rarg[]minbucket is the minimum number of observations in any terminal leaf node.
The two variables Rarg[]minbucket and Rarg[]minsplit are closely related. In rpart if either is not specified then by default the other is calculated as .
Using rpart directly we specify Roption[]minbucket within an option called Roption[]control which takes the results from a function called rpart.control. In this example we
> audit <- read.csv(url("http://rattle.togaware.com/audit.csv")) > audit.rpart <- rpart(TARGET_Adjusted ~ Age + Marital + Occupation + Deductions, data=audit, method="class", control=rpart.control(minbucket=100)) > audit.rpart |
Changing Rarg[]minbucket can result in different variables being chosen at different nodes. Compare the tree obtain with the command above (with Rarg[]minbucket set to 100) to the result when Rarg[]minbucket is set to 10. Note how node 7 was originally split using Age but with the minimum bucket size set to 10 the node is split on Deductions. We can see why -- the resulting node 15 has only 30 entities:
[...] control=rpart.control(minbucket=100)) [...] 7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721) 14) Age< 36.5 151 72 0 (0.52317881 0.47682119) * 15) Age>=36.5 365 128 1 (0.35068493 0.64931507) * [...] control=rpart.control(minbucket=10)) [...] 7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721) 14) Deductions< 1299.833 486 207 1 (0.42592593 0.57407407) [...] 15) Deductions>=1299.833 30 0 1 (0.00000000 1.00000000) * |
Whilst the default is to set Rarg[]minbucket to be one third of Rarg[]minsplit there is no requirement for Rarg[]minbucket to be less than Rarg[]minsplit. A node will always have at least Rarg[]minbucket entities, and it will be considered for splitting if it has at least Rarg[]minsplit entities and on splitting, each of its children have at least Rarg[]minbucket entities.
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.