DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
Data is the starting point for all data mining--without it there is nothing to mine. In today's world there is certainly no shortage of data, but turning that data into information, knowledge, and eventually, perhaps, wisdom, is not a simple matter.
Whilst data abounds in our modern era we still need to scout around to obtain the data we need. Many of today's organisations maintain massive warehouses of data. This provides both a fertile ground for sourcing data, but also an extensive headache for us in navigating through a massive landscape.
An early step in a data mining project is to bring all the required data together. This seemingly simple task can be a significant burden on the budgeted resources for data mining, perhaps consuming up to 70% of the elapsed time of a project. It should not be under-estimated.
In bringing data together a number of issues need to be considered. These include the provenance (source and purpose) and quality (accuracy and reliability) of the data. Data collected for different purposes may well store different information in confusingly similar ways. Also, some data requires appropriate permissions for its use, and the privacy of anyone the data relates to needs to be considered. Time spent at this stage getting to know your data will be time well spent.
In this chapter we introduce data, starting with the language we use to describe and talk about data. A number of sample datasets will be introduced and we'll see how we can obtain and manage these datasets. We use these sample datasets, and the various manipulations we make to them, to introduce R as a language for manipulating data.