Data Mining Survivor: Other_Transformations

DATA MINING
Desktop Survival Guide
by Graham Williams

Removing Duplicates

The function duplicated identifies elements of a data structure that are duplicated:

> x <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) > duplicated(x)

[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

> x <- x[!duplicated(x)] > x

[1] 1 2 3

This is a simple example, but works just as well to remove duplicated rows from a matrix or data frame.

For whatever reason, suppose we have loaded the audit dataset into Rattle and want to remove duplicated hours, keeping just the first one of each. This process is performed external to Rattle and we need to have Rattle reset its view of the data, through a click of the Execute button, with the resulting smaller dataset as shown in Figure 23.11.

> crs$dataset <- crs$dataset[!duplicated(crs$dataset$Hours),]

**Figure 23.11:** The dataset has been modified externally to remove specific rows and then the Execute button clicked so that Rattle will notice the changes.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010