DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Normalising Data |
R's scale is used to re-center and re-scale data in a numeric matrix. The re-centering involves subtracting a column's mean from each value in the column. The re-scaling then divides each value by the root-mean-square.
> ds <- wine[1:20,c(2,9,14)] > summary(ds) Alcohol Nonflavanoids Proline Min. :13.16 Min. :0.1700 Min. : 735 1st Qu.:13.72 1st Qu.:0.2600 1st Qu.:1061 Median :14.11 Median :0.2950 Median :1280 Mean :14.01 Mean :0.2970 Mean :1235 3rd Qu.:14.32 3rd Qu.:0.3225 3rd Qu.:1352 Max. :14.83 Max. :0.4300 Max. :1680 > ds Alcohol Nonflavanoids Proline 1 14.23 0.28 1065 2 13.20 0.26 1050 3 13.16 0.30 1185 4 14.37 0.24 1480 5 13.24 0.39 735 6 14.20 0.34 1450 7 14.39 0.30 1290 8 14.06 0.31 1295 9 14.83 0.29 1045 10 13.86 0.22 1045 11 14.10 0.22 1510 12 14.12 0.26 1280 13 13.75 0.29 1320 14 14.75 0.43 1150 15 14.38 0.29 1547 16 13.63 0.30 1310 17 14.30 0.33 1280 18 13.83 0.40 1130 19 14.19 0.32 1680 20 13.64 0.17 845 > scale(ds) Alcohol Nonflavanoids Proline 1 0.4630901 -0.27054355 -0.7184008 2 -1.7198976 -0.58883009 -0.7819386 3 -1.8046738 0.04774298 -0.2100983 4 0.7598069 -0.90711662 1.0394785 5 -1.6351214 1.48003239 -2.1162325 6 0.3995079 0.68431605 0.9124029 7 0.8021950 0.04774298 0.2346663 8 0.1027912 0.20688625 0.2558456 9 1.7347334 -0.11140029 -0.8031179 10 -0.3210899 -1.22540316 -0.8031179 11 0.1875674 -1.22540316 1.1665541 12 0.2299555 -0.58883009 0.1923078 13 -0.5542245 -0.11140029 0.3617419 14 1.5651810 2.11660546 -0.3583532 15 0.7810009 -0.11140029 1.3232807 16 -0.8085532 0.04774298 0.3193834 17 0.6114485 0.52517278 0.1923078 18 -0.3846721 1.63917565 -0.4430703 19 0.3783139 0.36602952 1.8866493 20 -0.7873591 -2.02111950 -1.6502886 attr(,"scaled:center") Alcohol Nonflavanoids Proline 14.0115 0.2970 1234.6000 attr(,"scaled:scale") Alcohol Nonflavanoids Proline 0.47183042 0.06283646 236.07991510 > ds Alcohol Nonflavanoids Proline 1 14.23 0.28 1065 2 13.20 0.26 1050 3 13.16 0.30 1185 4 14.37 0.24 1480 5 13.24 0.39 735 6 14.20 0.34 1450 7 14.39 0.30 1290 8 14.06 0.31 1295 9 14.83 0.29 1045 10 13.86 0.22 1045 11 14.10 0.22 1510 12 14.12 0.26 1280 13 13.75 0.29 1320 14 14.75 0.43 1150 15 14.38 0.29 1547 16 13.63 0.30 1310 17 14.30 0.33 1280 18 13.83 0.40 1130 19 14.19 0.32 1680 20 13.64 0.17 845 > summary(scale(ds)) Alcohol Nonflavanoids Proline Min. :-1.805e+00 Min. :-2.021e+00 Min. :-2.116e+00 1st Qu.:-6.125e-01 1st Qu.:-5.888e-01 1st Qu.:-7.343e-01 Median : 2.088e-01 Median :-3.183e-02 Median : 1.923e-01 Mean :-3.381e-15 Mean :-6.217e-16 Mean : 3.886e-16 3rd Qu.: 6.485e-01 3rd Qu.: 4.058e-01 3rd Qu.: 4.994e-01 Max. : 1.735e+00 Max. : 2.117e+00 Max. : 1.887e+00 |
The function rescaler from Hadley Wickham's reshape package supports five methods for rescaling/standardising data: rescale to ; subtract mean and divide by the standard deviation; subtract median and divide by median absolute deviation; convert values to a rank; and do nothing.