Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Archetype Analysis

See http://www.jstatsoft.org/v30/i08/paper

Can only handle numeric data. Use weather for example, and extract just the numeric columns, and make sure they are numeric Todo: Fix generation of the weather data as these columns come out as character. For the first part mimic the paper with a 2 var dataset.



> vars <- c("MinTemp", "MaxTemp")
> ds <- na.omit(apply(weather[vars], 2, as.numeric))

Now build the archetypes. We don't know how many we might want, but start with 4 to illustrate.



> set.seed(42)
> a <- archetypes(ds, 4)

Now let's explore them with two plots Todo: Get the two plots and display in a figure.



> atypes(a)
> par(mfrow=c(1,1))
> plot(a, ds, chull=chull(ds), cex=0.6)
> plot(a, ds, adata.show=TRUE, cex=0.6)

Todo: Split out and comment on the following



> ahistory(a, step=0)
> movieplot(a, ds)
> # Avoid local minima
> 
> set.seed(1960)
> a4 <- stepArchetypes(data=ds, k=4, verbose=FALSE, nrep=4)
> summary(a4)
> plot(a4, ds)
> bestModel(a4)
> # What is best number of architypes (so iterate over the k).
> 
> set.seed(1960)
> as <- stepArchetypes(data=ds, k=1:10, verbose=FALSE, nrep=4)
> # Have a look at the residual sum of squares (could be used below 
> # with whole data where there are some warnings.
> 
> rss(as)
> # Look at the iterations. For any that are 1 we might expect warnings
> # from and rss of NA - problems with initial random starts. We don't
> # have any here.
> 
> iters(as)
> # Now look at the "elbow criterion" for the best number of archetypes: 4 or 7.
> 
> screeplot(as)
> # We plotted 4 above, so let's look at 7
> 
> a7 <- bestModel(as[[7]])
> plot(a7, ds, chull=chull(ds))
> # Now do it with multiple numeric columns
> 
> numcol <- c(2:6,8,11:20)
> ds <- na.omit(apply(weather[numcol], 2, as.numeric))
> omitted <- attr(ds, "na.action")
> # Let's have a look at parallel coordinates - no obvious number of prototypes.
> 
> pcplot(ds)
> # Experiment
> 
> set.seed(1960)
> as <- stepArchetypes(ds, k=1:15, verbose=FALSE, nrep=3)
> # Know look for elbows - maybe 4 or 8. Let's go with 4 - the simpler number.
> 
> screeplot(as)
> a4 <- bestModel(as[[4]])
> # display (transpose to look better).
> 
> t(atypes(a4))
> barplot(a4, ds, percentage=TRUE) # Fails
> pcplot(a4, ds, data.col=rainbow_hcl(2)[as.numeric(weather$RainTomorrow[-omitted])])



Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010