DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
R is a command line tool. We saw how to interact with R in Section 2.1. Essentially, R displays a prompt to indicate that it is waiting for us to issue a command. Two such commands are library and rattle.
Generally we instruct R to evaluate functions--a technical term used to describe a mathematical object that returns a result. All functions in R return a result and that result can be passed to other functions to do other things. This simple idea is actually a very powerful concept, allowing functions to do well what they are designed to do (like building a model), and pass on their output to other functions to do something with it (like formatting it for easy reading).
Functions might also have side effects--that is, they might do more than simply returning some result. We evaluate the function rattle, for example, not to get a result from the function, but to start up the GUI and allow us to start data mining. Whilst rattle is still a function, we will usually refer to is as a command rather than a function, though the two terms can be, and often are, used interchangeably.
We saw in Section 2.1 two function calls, which we repeat below. The first was a call to the function library where we asked R to load the rattle package. We then started up Rattle with a call to the rattle function:
> library(rattle) > rattle() |
Irrespective of the purpose of the function, for each call of a function we usually supply arguments that refine the behaviour of the function. We did that above in the call to the library function where the argument was rattle. Another simple example is to call the dim (dimensions) function with the argument weather.
> dim(weather) |
[1] 366 24 |
Here, weather is a variable name. We can think of it simply as a reference to some object (something that contains data). The object, in this case, is the weather dataset as introduced above. It is organised as rows and columns. We might also note that the name dim itself is a reference to some object--a function object.
If we type a name (e.g., either weather or dim) at the R prompt, R will respond by showing us the object. Typing weather (followed by pressing the Enter key) will result in the actual data. We will see all 366 rows of data scrolled to the screen. If we type dim and press Enter we will see the definition of the function (which in this case is a primitive function coded into the core of R):
> dim |
function (x) .Primitive("dim") |
A common mistake made by new users is to type a function name by itself (without arguments) and end up a little confused about the resulting output. To actually invoke the function we need to supply the argument list, which may be an empty list. Thus, at a minimum, we add () to the function call on the command line:
> dim() |
Error in dim() : 0 arguments passed to 'dim' which requires 1 |
As above, executing this command will generate an error message. We note that dim actually needs 1 argument and no arguments were passed to it. Some functions can be invoked with no arguments, as is the case for the rattle command (see Section 2.1).
The examples above illustrate how we will show our interaction with R. The `>' is R's prompt, and when we see that we know that R is waiting for commands. We type the string of characters, dim(weather), as the command--in this case a call to the dim function. We then press the Enter key to send the command off to R. R responds with the result from the function. In the case above it returned the result [1] 366 24.
The dim function returns a vector (a sequence of items) of length two. The [1] simply tells us that the first number we see from the vector (the 366) is the first element of the vector. The second element is 24.
The two numbers listed by R in the above example (i.e., the vector returned by the dim function) are the number of rows and columns, respectively, in the weather dataset--that is, its dimensions.
For very long vectors the listing of the elements of the vector will be wrapped to fit across the screen, and each line will start with a number within square brackets to indicate what element of the vector we are up to. We can illustrate this with the seq command which generates a sequence of numbers:
> seq(1, 50) |
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
Above we saw that we can view the actual data stored in a variable by typing the name of the object (weather) at the command prompt. Generally this will print too many lines (although only 366 in the case of the weather dataset). A useful pair of functions for inspecting our data are head and tail. These will list just the top and bottom 6 observations (or rows of data), by default, from the data frame, based on the order in which they appear there. Here we request, through arguments to the command, to list the top or bottom, 2 or 3 rows, respectively.
> head(weather, 2) |
Date Location MinTemp MaxTemp Rainfall Evaporation 1 2007-11-01 Canberra 8 24.3 0.0 3.4 2 2007-11-02 Canberra 14 26.9 3.6 4.4 Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm 1 6.3 NW 30 SW NW 2 9.7 ENE 39 E W WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm 1 6 20 68 29 2 4 17 80 36 Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm 1 1019.7 1015.0 7 7 14.4 23.6 2 1012.4 1008.4 5 3 17.5 25.7 RainToday RISK_MM RainTomorrow 1 No 3.6 Yes 2 Yes 3.6 Yes |
> tail(weather, 3) |
Date Location MinTemp MaxTemp Rainfall 364 2008-10-29 Canberra 12.5 19.9 0 365 2008-10-30 Canberra 12.5 26.9 0 366 2008-10-31 Canberra 12.3 30.2 0 Evaporation Sunshine WindGustDir WindGustSpeed 364 8.4 5.3 ESE 43 365 5.0 7.1 NW 46 366 6.0 12.6 NW 78 WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm 364 ENE ENE 11 9 365 SSW WNW 6 28 366 NW WNW 31 35 Humidity9am Humidity3pm Pressure9am Pressure3pm 364 63 47 1024.0 1022.8 365 69 39 1021.0 1016.2 366 43 13 1009.6 1009.2 Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM 364 3 2 14.5 18.3 No 0 365 6 7 15.8 25.9 No 0 366 1 1 23.8 28.6 No 0 RainTomorrow 364 No 365 No 366 No |
The weather dataset is more complex than the simple vectors we have seen above. In fact the weather dataset is a special kind of list called a data frame, which is one of the most common data structures in R for storing our datasets. A data frame is essentially a list of columns. The weather dataset has 24 columns. For a data frame each column is a vector, each of the same length.
If we only want to review certain rows or columns of the data frame we can index the dataset name. Indexing simply uses square brackets to list the row numbers and column numbers that are of interest to us:
> weather[4:8, 2:4] |
Location MinTemp MaxTemp 4 Canberra 13.3 15.5 5 Canberra 7.6 16.1 6 Canberra 6.2 16.9 7 Canberra 6.1 18.2 8 Canberra 8.3 17.0 |
Notice the notation for a sequence of numbers. The string 4:8 is actually equivalent to a call to the function seq with two arguments, 4 and 8. The function returns a vector containing the integers from 4 to 8. It's the same as listing them all for c:
> 4:8 |
[1] 4 5 6 7 8 |
> seq(4, 8) |
[1] 4 5 6 7 8 |
> c(4, 5, 6, 7, 8) |
[1] 4 5 6 7 8 |
Before we finish our basic introduction to the R command line it is important to know how to learn more. From the command line we obtain help on functions by calling the help function:
> help(dim) |
A shorthand is to precede the function name with a ? as in: ?dim. This is automatically converted into a call to the help function.
The help.search function will search the documentation to list functions that may be of relevance to the topic we supply as an argument:
> help.search("dimensions") |
The shorthand here is to precede the string with two question marks as in ??dimensions.
A third useful command for searching for help on a topic is the function RSiteSearch. This will submit a query to the R project's search engine on the Internet:
> RSiteSearch("dimensions") |
Finally, recall that to exit from R, as we saw in Section 2.1, is to issue the q command:
> q() |
Our first session with R is now complete. The command line, as we have introduced here is where we access the full power of R. But not everyone wants to learn and remember commands, and so Rattle will get us started quite quickly into data mining, with only our minimal knowledge of the command line.
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.