Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Search Heuristic

The decision tree structure, as described in the previous section, is then the ``language'' we use to express our knowledge. A sentence (or model) in this language is then a particular decision tree. For any dataset there will be a very large, or even infinite, number of possible decision trees (sentences).

Consider the simple decision tree discussed above. Instead of the variable Pressure3pm being tested against the value 1012, it could, instead, have been tested against the value 1011, or 1013, or 1020, etc. Each would, when the rest of the tree has been built, represent a different sentence in the language, representing a slightly different capture of the knowledge. There are very many possible values to choose from, for just this one variable, even before we begin to consider values for the other variables that appear in the decision tree.

Alternatively, we might choose to test the value of a different variable at the root node (or any other node). Perhaps we could test the value of Humidity3pm instead of Pressure3pm. This again introduces a very large collection of possible sentences that we might generate within the constrains of the language we have defined. Each sentence is a candidate for the capture of knowledge that is consistent with the observations represented in our training dataset.

As we saw in Section 9.1 this wealth of possible sentences presents a challenge--which is the best sentence or equivalently which is the best model of our data? Our task is to identify the sentence (or perhaps sentences) which best captures the knowledge that can be obtained from the observations that we have available to us. Yet we generally have an infinite collection of possible sentences to choose from.

Enumerating every possible sentence, and then testing whether it is a good model, will generally be too computationally expensive. This could well involve days, weeks, months or more of computer time. Our task is to use the observations (the training dataset) to narrow down this search task so that we can find a good model in a reasonable amount of time.

The algorithm that has been developed for decision tree induction is referred to as a divide-and-conqueror, or a recursive partitioning, algorithm. The basic idea is to XXXX

The usual decision tree induction algorithm is what is generally called a greedy algorithm. It is greedy in that it decides on a question to ask (at a particular node of the decision tree) then don't consider any more alternatives later on. The algorithms are also dived and conquer because they partition the database into smaller sets. In fact, the general algorithm continually partitions the database into smaller sets until the sets all have the same value for the output variable.

Generally C4.5 is quite efficient as the number of training instances increases, and for specific datasets has been found empirically to be between $O(n^{1.22})$ and $O(n^{1.38})$. With rule generation the algorithm is somewhat more expensive at $O(n^4)$.

Todo: Formal specification of the algorithm in pseudocode.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010