DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Database Connection |
The basic usage of RODBC will connect to a known ODBC object using the odbcConnect function and query the database for the tables it exports using sqlTables:
> library(RODBC) > channel <- odbcConnect("DWH") # This may pop up a window to enter username and password > tables <- sqlTables(channel) > columns <- sqlColumns(channel, "clients") |
You can then retrieve the full contents of a table with
sqlFetch:
> ds <- sqlFetch(channel, "tablename") |
> ds <- sqlQuery(channel, "SELECT * FROM clients WHERE age > 35") |
Some ODBC drivers, such as the Netezza ODBC driver, have a pre-fetch option that interacts poorly with applications connecting through the driver. With a pre-fetch option the driver appears to report fewer rows being available than actually available. It seems that the number of rows reported is in fact the pre-fetch limited number of rows. For the Netezza ODBC driver, for example, the default is 256 rows. This confuses the application connecting to ODBC (in this case, R through the RODBC). The symptom is that we only receive 256 rows from the table. Internally, the application is probably using either the SQLExtendedFetch or SQLFetchScroll ODBC functions.
A further option indicates whether to believe the results of a query that reports on the number of rows in a table (some ODBC drivers behave differently in this respect). This needs to be set for some ODBC drivers. The Believe Num Rows option is an oddity required for some ODBC drivers and appears to be associated with the pre-fetch behaviour of these drivers. The default is to activate the check box (i.e., Believe Num Rows is True). However, if you find that you are not retrieveing all rows from the source table, the the ODBC driver may be using a pre-fetch mechanism that does not ``correctly'' report the number of rows (it is probably only reporting the number of rows limited to the size of the pre-fetch). In these cases deactivate the Believe Num Rows check box.
There are a number of solutions to this issue. One from the applications side is to set Roption[]believeNRows to FALSE. This will then retrieve all the rows from the table. Another solution is at the driver configuration level. For example, in connecting through the Netezza ODBC driver a configuration option is available where you can change the default Prefetch Count value.
An example of the issue is illustrated below:
> channel <- odbcConnect("netezza") > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 500") > dim(orders) [1] 256 9 > orders <- sqlQuery(channel, "select * from orders limit 500", believeNRows=FALSE) > dim(orders) [1] 500 9 > odbcCloseAll() |
We can reopen the driver and in the resulting GUI configuration set the Prefetch Count to perhaps 10,000. Then:
> channel <- odbcConnect("netezza") > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 500") > dim(orders) [1] 500 9 > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 50000") > dim(orders) [1] 10000 9 > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 50000", believeNRows=FALSE) > dim(orders) [1] 50000 9 > |
Note that we would not want to default beleiveNRows to FALSE since, for example, with a Teradata query this increase the query time by some 3 times!
For an SQLite database, edit .odbc.ini
[audit] Description=SQLite test audit database for Rattle Driver=SQLite3 Database=/home/kayon/Projects/rattle/audit.db # optional lock timeout in milliseconds Timeout=2000 |
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.