DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Pivot Tables |
The reshape package was inspired by pivot tables.
The basic principle to understand in using reshape is the two types of variables in our data. One collection are the identifier variables and the other are the ``measures''. The melt function will reorganise the data so that each identifier variable is a single column (where they may have previously been a mixture of rows and columns).
The aim is to generate various reorganisations and aggregate summaries
of the data. For example, with the wine dataset we may
like to look at the average values of a number of input variables for
each Type. The first step, using the reshape
package, is
to melt the data frame, which expands the non-identifying
variables across the identifying variables:
> wine.molten <- melt(wine, id="Type") > head(wine.molten) Type variable value 1 1 Alcohol 14.23 2 1 Alcohol 13.20 3 1 Alcohol 13.16 4 1 Alcohol 14.37 5 1 Alcohol 13.24 6 1 Alcohol 14.20 > tail(wine.molten) Type variable value 17312 3 Proline 660 17412 3 Proline 740 17512 3 Proline 750 17612 3 Proline 835 17712 3 Proline 840 17812 3 Proline 560 |
> cast(wine.molten, Type ~ variable, mean) Type Alcalinity Alcohol Ash Color Dilution Flavanoids Hue 1 17.03729 13.74475 2.455593 5.528305 3.157797 2.9823729 1.0620339 2 20.23803 12.27873 2.244789 3.086620 2.785352 2.0808451 1.0562817 3 21.41667 13.15375 2.437083 7.396250 1.683542 0.7814583 0.6827083 Magnesium Malic Nonflavanoids Phenols Proanthocyanins Proline 106.3390 2.010678 0.290000 2.840169 1.899322 1115.7119 94.5493 1.932676 0.363662 2.258873 1.630282 519.5070 99.3125 3.333750 0.447500 1.678750 1.153542 629.8958 |
We can also include the column and row totals. We will illustrate this
with a subset of the wine dataset:
> measure <- c("Alcohol", "Malic", "Ash") > wine.molten <- melt(wine, id="Type", measure=measure) > cast(wine.molten, Type ~ variable, mean, margins=c("grand_row", "grand_col")) Type Alcohol Malic Ash . 1 13.74475 2.010678 2.455593 6.070339 2 12.27873 1.932676 2.244789 5.485399 3 13.15375 3.333750 2.437083 6.308194 . 13.00062 2.336348 2.366517 5.901161 |
> aggregate(wine[,-1], list(Type=wine$Type), mean) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids 1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 Nonflavanoids Proanthocyanins Color Hue Dilution Proline 1 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119 2 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070 3 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958 |
Another example using reshape.
> dat <- read.table("clipboard", header=TRUE) > dat Q S C 1 1 A 5 2 1 B 10 3 1 C 50 4 1 D 10 5 2 A 20 6 2 E 10 7 2 C 40 8 3 D 5 9 3 F 1 10 3 G 5 11 3 B 75 > res <- reshape(dat, direction = "wide", idvar = "Q", timevar = "S") > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 NA NA NA 5 2 20 NA 40 NA 10 NA NA 8 3 NA 75 NA 5 NA 1 5 > res[is.na(res)] <- 0 > names( > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 0 0 0 5 2 20 0 40 0 10 0 0 8 3 0 75 0 5 0 1 5 |
Or the same, but using the reshape package:
> library(reshape) > datm <- melt(dat, id=1:2) > cast(datm, Q ~ S) S A B C D E F G Q A B C D E F G 1 5 10 50 10 NA NA NA 2 20 NA 40 NA 10 NA NA 3 NA 75 NA 5 NA 1 5 |
With the basics in hand we can now explore the data in a more graphical fashion, beginning with plots that help understand individual variables (barplot, piechart, and line plots), followed by a number of plots that explore relationships between variables (scatterplot and correlation plot).
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.