DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The CSV option of the Data tab is an easy way to load data from many different sources into Rattle. CSV stands for ``comma separated value'' and is a standard file format often used to exchange data between applications. CSV files can be exported from spreadsheets and databases, including OpenOffice Calc, Gnumeric, MS/Excel, SAS/Enterprise Miner, Teradata's Warehouse, and many, many, other applications. This is a pretty good option for importing your data into Rattle, although it does lose meta data information (that is, information about the data types of the dataset). Without this meta data R sometimes guesses at the wrong data type for a particular column, but it isn't usually fatal!
An example CSV file is provided by Rattle and is called
audit.csv. It will have been installed when we installed
Rattle and we would find it's actual location with:
[float,caption={Locate and view the package supplied sample dataset},label={lst:system.file}] > system.file("csv", "audit.csv", package = "rattle") [1] "/usr/local/lib/R/site-library/rattle/csv/audit.csv" > file.show(system.file("csv", "audit.csv", package = "rattle")) |
The top of the file will be similar to the following (perhaps with
quotes around values, although they are not necessary, and perhaps
with some different values):
[float,caption={[Sample of a CSV format dataset]A sample of the top 6 lines of the CSV file audit.csv},label={lst:audit-csv}] ID,Age,Employment,Education,Marital,Occupation,Income,Gender,... 1004641,38,Private,College,Unmarried,Service,81838,Female,... 1010229,35,Private,Associate,Absent,Transport,72099,Male,... 1024587,32,Private,HSgrad,Divorced,Clerical,154676.74,Male,... 1038288,45,Private,Bachelor,Married,Repair,27743.82,Male,... 1044221,60,Private,College,Married,Executive,7568.23,Male,... ... |
A CSV file is actually a normal text file that you could load into any text editor to review its contents. A CSV file usually begins with a header row, listing the names of the variables, each separated by a comma. If any name (or indeed, any value in the file) contains an embedded comma, then that name (or value) will be surrounded by quote marks. The remainder of the file after the header is expected to consist of rows of data that record information about the entities, with fields generally separated by commas recording the values of the variables for this entity.
|
To make a CSV file known to Rattle we click the Filename button. A file chooser dialog will pop up (Figure 3.2). We can use this to browse our file system to find the file we wish to load into Rattle. By default, only files that have a .csv extension will be listed (together with folders). The pop up includes a pull down menu near the bottom right, above the Open button, to allow you to select which files are listed. You can list only files that end with a .csv or a .txt or else to list all files. The .txt files are similar to CSV files but tend to use tab to separate columns in the data, rather than commas. The window on the left of the popup allows us to browse to the different file systems available to us, while the series of boxes at the top let us navigate through a series of folders on a single file system. Once we have navigated to the folder on the file system on which we have saved the audit.csv file, we can select this file in the main panel of the file chooser dialog. Then click the Open button to tell Rattle that this is the file we are interested in.
|
Notice in Figure 3.3 that the textview of the Data tab has changed to give a reminder as to what we need to do next. That is, we have not yet told Rattle to actually load the data--we have just identified where the data is. So we now click the Execute button (or press the F5 key) to load the dataset from the audit.csv file. Since Rattle is a simple graphical interface sitting on top or R itself, the message in the textview also reminds us that some errors encountered by R on loading the data (and in fact during any operation performed by Rattle) may be displayed in the R Console.
You can choose the field delimiter through the Separator entry. A
comma is the default. To load a .txt file which uses a tab as the field separator enter \\t
(that is, two slashes followed by a t) as the separator. You
can also leave the separator empty and any white space will be used as
the separator.
Any data with missing values (i.e., no value between a pair of commas) or having the value ``NA'' or ``.'' or ``?'' is treated as a missing value, which is represented in R as the string NA. Support for the ``.'' convention allows the importation of CSV data generated by SAS, whilst the usage of ``?'' is common following its usage in some of the early machine learning applications like C4.5.
The contents of the textview of the Data tab has now changed again, as we see in Figure 3.4. The panel contains a brief summary of the dataset. From the summary we see that Rattle has loaded the file we requested, showing the full path to the file. We then see that Rattle has created something called a 'data.frame'. This is a basic data type in R used to store a table of data, where the columns (the variables) can have a mixture of data types. We then see that Rattle has loaded 2,000 entities (called observations or obs. in R), each described by 13 variables. The data type, and the first few values, for each entity are also displayed.
We can start getting an idea of the shape of the data from this simple summary. For example, the first two variables, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID and XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesAge, are both identified as integers (int). The first few values of XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID are 1004641, 1010229, 1024587, and so on. They all appear to be of the same length (i.e, the same number of digits) and together with having a name like XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID provides a very strong indicator that this is some kind of identifier for each entity. The first few values of XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesAge are 38, 35, 32, 45, 60, and so on.
The next variable, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesEmployment, illustrates how R deals with categorical variables. In R terms it is a Factor with 8 levels (i.e., 8 possible values). The levels begin with "Consultant" and "Private". The following sequence of numbers, all of which happen to be 2 for the first 10 entities of this dataset, discloses how R stores categorical data. Effectively, R maintains an integer indexed table, associating the levels with integers, so that "Consultant" is associated with 1, "Private" with 2, and so on. Then only these integers need to be stored for each entity, which is generally more efficient on memory usage. We see this more convincingly for the following categorical variables, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesEducation, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesMarital, and XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesOccupation (because they have more than just a single level displayed in this summary).
The seventh variable, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesIncome, has been identified as a more general numeric rather than specific integer variable. The display of the first few values does not actually give us any insight as to why this might be so, but reviewing the actual CSV data as in Listing on page , we see that the third entity actually has a value of for XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesIncome, indicating that these values are real numbers rather than just integers.
We also note that Adjusted, for example, looks like it might be a categorical variable, with values 0 and 1, but R identifies it as an integer! That's fine for our purposes here. We can always changes this later.
Copyright © Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.