DATA MINING
Desktop Survival Guide
by
Graham Williams
Desktop Survival
Project Home
Data Mining
Introduction
Getting Started
Data
Loading Data
Exploring Data
Interactive Graphics
Test
Descriptive Data Mining
Predictive Data Mining
Evaluation and Deployment
Data Cleaning
Handling Missing Data
Transforming Data
Data Reduction
Deployment
Troubleshooting
Issues
Moving into R
Beyond Rattle
R
Getting Help
Data
Graphics in R
Understanding Data
Preparing Data
Descriptive and Predictive Analytics
Issues
Evaluating Models
Reporting
Topics in Data Mining
Fraud Analysis
Archetype Analysis
Text Mining
Survival Analysis
Algorithms
Bagging
Bayes Classifier
Cluster Analysis
Conditional Trees
Hierarchical Clustering
K-Nearest Neighbours
Linear Models
Support Vector Machines
Open Products
AlphaMiner
Borgelt Data Mining Suite
KNime
R
Rattle
Weka
Closed Products
C4.5
Clementine
Equbits Foresight
GhostMiner
InductionEngine
ODM
Enterprise Miner
Statistica Data Miner
TreeNet
Virtual Predict
Appendices
Installing Rattle
Bibliography
Index
Preface
Goals
Organisation
Features
Audience
Typographical Conventions
A Note on Languages
Currency
Acknowledgements
Beyond Rattle: R for the Data Miner
Subsections
R: The Language
Evaluation
Exercises
Assignment
Libraries and Packages
Searching for Objects
Package Management
Information About a Package
Testing Package Availability
Packages and Namespaces
Basic Programming in R
Principles
Folders and Files
Flow Control
If Statement
For Loop
Functions
Apply
Methods
Objects
System
Running System Commands
System Parameters
Misc
Internet
Memory Management
Memory Usage
Garbage Collection
Errors
Frivolous
Sudoku
Further Resources
Using R
Specific Purposes
Survey Analysis
Getting Help
R Documentation
Data
Data Types
Numbers
Strings
Building Strings
Splitting Strings
Substitution
Trim Whitespace
Evaluating Strings
Logical
Dates and Times
Space
Data Structures
Vectors
Arrays
Lists
Sets
Matricies
Exercises
Data Frames
Accessing Columns
Removing Columns
Exercises
General Manipulation
Factors
Elements
Rows and Columns
Finding Index of Elements
Partitions
Head and Tail
Reverse a List
Sorting
Unique Values
Loading Data
Interactive Responses
Interactive Data Entry
Available Datasets
The Iris Dataset
CSV Data Used In The Book
The Wine Dataset
The Cardiac Arrhythmia Dataset
The Adult Survey Dataset
Foreign Formats
Stata Data
Conversions
Reading Variable Width Data
Saving Data
Formatted Output
Automatically Generate Filenames
Reading a Large File
Manipulating Data
Manipulating Data As SQL
Using SQLite
ODBC Data
Database Connection
Excel
Access
Clipboard Data
Spatial Data
Simple Map
A Density Map
Overlays and Point in Polygon
Other Data Formats
Fixed Width Data
Global Positioning System
Documenting a Dataset
Common Data Problems
Graphics in R
Basic Plot
Controlling Axes
Arrow Axes
Legends and Points
Tables Within Plots
Colour
Labels in Plots
Axis Labels
Legend
Labels Within Plots
Maths in Labels
Multiple Plots
MatPlot
Multiple Plots Using ggplot2
Using GGPlot
Networks
Symbols
Other Graphic Elements
Making an Animation
Animated Mandelbrot
Adding a Logo to a Graphic
Graphics Devices Setup
Screen Devices
Multiple Devices
File Devices
Multiple Plots
Copy and Print Devices
Graphics Parameters
Plotting Region
Locating Points on a Plot
Scientific Notation and Plots
Understanding Data
Single Variable Overviews
Textual Summaries
Multiple Line Plots
Separate Line Plots
Pie Chart
Fan Plot
Stem and Leaf Plots
Histogram
Barplot
Trellis Histogram
Histogram Uneven Distribution
Bump Chart
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Multiple Variable Overviews
Scatterplot
Scatterplot with Marginal Histograms
Multi-Dimension Scatterplot
Correlation Plot
Colourful Correlations
Projection Pursuit
RADVIZ
Parallel Coordinates
Categoric and Numeric
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Tuning a Boxplot
Boxplot Using Lattice
Boxplot Using ggplot
Violin Plot
What Distribution
Labelling Outliers
Miscellaneous Plots
Line and Point Plots
Matrix Data
Multiple Plots
Aligned Plots
Probability Scale
Network Plot
Sunflower Plot
Stairs Plot
Graphing Means and Error Bars
Bar Charts With Segments
Bar Plot With Means
3d Bar Plot
Stacks Versus Lines
Multi-Line Title
Mathematics
Plots for Normality
Basic Bar Chart
Bar Chart Displays
Multiple Dot Plots
Alternative Multiple Dot Plots
3D Plot
Box and Whisker Plot
Box and Whisker Plot: With Means
Clustered Box Plot
Perspective Plots
Star Plot
Residuals Plot
Waterfall Plots
Dates and Times
Simple Time Series
Multiple Time Series
Plot Time Series
Plot Time Series with Axis Labels
Grouping Time Series for Box Plot
Time Series Heatmap
Textual Summaries
Stem and Leaf Plots
Histogram
Barplot
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Correlation Plot
Colourful Correlations
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Box and Whisker Plot
Box and Whisker Plot: With Means
Clustered Box Plot
Further Resources
Map Displays
Further Resources
Preparing Data
Data Selection and Extraction
Training and Test Datasets
Data Cleaning
Review Data
Selectively Changing Vector Values
Replace Indices By Names
Missing Values
Remove Levels from a Factor
Removing Outliers
Variable Manipulations
Remove Columns
Reorder Columns
Remove Non-Numeric Columns
Remove Variables with no Variance
Cleaning the Wine Dataset
Cleaning the Cardiac Dataset
Cleaning the Survey Dataset
Imputation
Nearest Neighbours
Multiple Imputation
Data Linking
Simple Linking
Record Linkage
Data Transformation
Aggregation
Sum of Columns
Pivot Tables
Normalising Data
Binning
Interpolation
Outlier Detection
Variable Selection
Descriptive and Predictive Analytics
Building a Model
Cluster Analysis: K-Means
Association Analysis: Apriori
Classification: Decision Trees
Classification: Boosting
Classification: Random Forests
Issues
Incremental or Online Modelling
Model Tuning
Tuning rpart
Unbalanced Classification
Building Models
Outlier Analysis
Temporal Analysis
Evaluation
Basics
Basic Measures
Cross Validation
Graphical Performance Measures
Lift
The ROC Curve
Other Examples
10 Fold Cross Validation
Area Under Curve
Calibration Curves
Reporting
Generating Open Document Format
Getting Started with odfWeave
OpenOffice.org Macro Support
Generating HTML
Generating PDF with
L
A
T
E
X
Configuration
Figure Sizes
Copyright © Togaware Pty Ltd
Support further development through the
purchase of the PDF
version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by
Togaware
. This page generated: Saturday, 16 January 2010