DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Using SQLite |
SQLite (from www.sqlite.org) is an open source database package that is well supported in R. It has the advantage that it requires no setup or administration (other than installing the package) and is an embedded system so that there is less of a connection overhead. You are able to manage very large datasets in SQLite without needing to load all the data into memory, unlike R itself, so that you are able to manipulate the data using SQL then load in just the data you need.
For small dataset SQLite is a good choice, but for very large datasets, MySQL still performs very well.
A Google Summer of Code project, created a package that stores data frames and matrices into sqlite databases, called sqlite data frames (sdf). These sdf's behave like ordinary data frames so that existing R functions will work. This enables R users to work with very large datasets much more readily, with no user effort.
As an example, first create an empty SQLite database (outside of R)
and import a CSV (comma separated value) file, telling sqlite to use
commas, not '|':
$ sqlite3 -separator , audit.db sqlite> create table audit(ID INTEGER, Age INTEGER, Employment TEXT, Education TEXT, Marital TEXT, Occupation TEXT, Income REAL, Sex TEXT, Deductions REAL, Hours INTEGER, Accounts TEXT, Adjustment REAL, Adjusted INTEGER); sqlite> .tables audit sqlite> .import audit.csv audit sqlite> select count(*) from audit; 2001 sqlite> delete from audit where ID='"ID"'; sqlite> select count(*) from audit; 2000 sqlite> .quit |
Now, using SQLiteDF we can treat the data as a Data Frame, avoiding
loading it into R (except when it is copied internally in an R
command):
> library(SQLiteDF) > audit <- sdfImportSQLite("audit.db", "audit") > audit SQLite data frame "audit" (2000 row(s) by 13 column(s)) stored in file ".SQLiteDF/audit.db" First 6 rows: ID Age Employment Education Marital Occupation Income Sex 1 1004641 38 "Private" "College" "Unmarried" "Service" 81838.00 "Female" 2 1010229 35 "Private" "Associate" "Absent" "Transport" 72099.00 "Male" 3 1024587 32 "Private" "HSgrad" "Divorced" "Clerical" 154676.74 "Male" 4 1038288 45 "Private" "Bachelor" "Married" "Repair" 27743.82 "Male" 5 1044221 60 "Private" "College" "Married" "Executive" 7568.23 "Male" 6 1047095 74 "Private" "HSgrad" "Married" "Service" 33144.40 "Male" Deductions Hours Accounts Adjustment Adjusted 1 0 72 "UnitedStates" 0 0 2 0 30 "Jamaica" 0 0 3 0 40 "UnitedStates" 0 0 4 0 55 "UnitedStates" 7298 1 5 0 40 "UnitedStates" 15024 1 6 0 30 "UnitedStates" 0 0 ... > sum(audit$Income) [1] 169376920 |
The SQLite package, as an alternative, allows the easy import and export of data to text files.
library(RSQLite) con <- dbConnect(SQLite(), "foo3.db") dbGetQuery(con, "pragma cache_size") cache_size 1 2000 dbGetQuery(con, "pragma cache_size=2500") NULL dbGetQuery(con, "pragma cache_size") cache_size 1 2500 |
Now in R in the same directory:
library(DBI) library(RSQLite) driver<-dbDriver("SQLite") connect<-dbConnect(driver, dbname = "audit.db") dbWriteTable(connect, "audit", audit, overwrite = T, row.names = F) dbListTables(connect) [1] "audit" query01 <- dbSendQuery(connect, statement = "select * from audit"); data01 <- fetch(query01, n = 10) contents(data01) sqliteCloseResult(query01) sqliteCloseConnection(connect) sqliteCloseDriver(driver) |
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.