DATA MINING
Desktop Survival Guide
by Graham Williams

Rescale Data

Different model builders require different characteristics of the data from which the models will be built. For example, when building a clustering using any kind of distance measure, we may need to normalise the data. Otherwise, a variable like Income will overwhelm a variable like Age, when calculating distances. A distance of 10 ``years'' may be more significant than a distance of $10,000, yet, swamps when they are added together, as would be the case by calculating distances.

In these situations we will want to Normalise our data. The types of normalisations (available through the Normalise option of the Transform tab) we may want to perform include re-centering and rescaling our data to be around zero (Recenter), rescaling our data to be in the range from 0 to 1 (Scale [0,1]), covert the numbers into a rank ordering (Rank), and finally, to do a robust rescaling around zero using the median (-Median/MAD). Figure displays the interface.

**Figure 23.2:** Selection of normalisations performed on Income.

We can see in Figure 23.2 the approach we take to normalising (and to transforming) our data. The original data is not modified. Instead, a new variable is created with a prefix added to the variable's name that indicates the kind of transformation. As we can see in the figure, the prefixes are NORM_RECENTER_, NORM_SCALE01_, NORM_RANK_, and NORM_MEDIANAD_.

We can see the effect of the four normalisations in comparing the histogram of the variable, Age, in (REPLACE WITH INCOME?) Figure , with the four plots in Figure 23.4 for the corresponding four normalisations.

**Figure 23.3:** Original distribution of Age.

**Figure 23.4:** Normalisations of Age.

Subsections

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010