Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Review Data

Often we will find ourselves loading data from a CSV file which is readily supported by R (See Section 30.3.4). On the first loading of the data we generally want to get a quick summary, using R's summary function. It is here that we might note that some numeric columns have become factors!

Consider the example of the cardiac dataset (See Section 30.3.4).



> cardiac <- read.csv("cardiac.data", header=F)
> summary(cardiac)
[...]
      V10               V11           V12           V13           V14     
 Min.   :-172.00   52     : 13   60     : 23   49     :  9   ?      :376  
 1st Qu.:   3.75   36     : 10   ?      : 22   55     :  9   84     :  3  
 Median :  40.00   42     :  9   61     : 16   59     :  9   -157   :  2  
 Mean   :  33.68   10     :  8   56     : 14   62     :  9   -164   :  2  
 3rd Qu.:  66.00   33     :  8   58     : 13   26     :  8   -93    :  2  
 Max.   : 169.00   41     :  8   68     : 12   33     :  8   103    :  2  
                   (Other):396   (Other):352   (Other):400   (Other): 65  
[...]

Our understanding of the data might be that we expect these variables to be numeric. Indeed, the telltale sign is V14 having a ? as one of its values. A little more exploration to show the frequency of each value will indicate that the apparently nominal variables only have a single non-numeric value, the ? When we read the data from the CSV file we need to tell R that the ? is used to indicate missing values



> cardiac <- read.csv("cardiac.data", header=F, na.string="?")
> summary(cardiac)
[...]
      V11               V12               V13               V14       
 Min.   :-177.00   Min.   :-170.00   Min.   :-135.00   Min.   :-179.00
 1st Qu.:  14.00   1st Qu.:  41.00   1st Qu.:  12.00   1st Qu.:-124.50
 Median :  41.00   Median :  56.00   Median :  40.00   Median : -50.50
 Mean   :  36.15   Mean   :  48.91   Mean   :  36.72   Mean   : -13.59
 3rd Qu.:  63.25   3rd Qu.:  65.00   3rd Qu.:  62.00   3rd Qu.: 117.25
 Max.   : 179.00   Max.   : 176.00   Max.   : 166.00   Max.   : 178.00
 NA's   :   8.00   NA's   :  22.00   NA's   :   1.00   NA's   : 376.00
[...]

That's looking better. Note that the NAs are reported and that V14 has 376 of them, in accord with the previous observation of 376 ?'s.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010