My new book is now available from Amazon.

From the cover:

The Essentials of Data Science: Knowledge Discovery Using R presents the concepts of data science through a hands-on approach using free and open source software. It systematically drives an accessible journey through data analysis and machine learning to discover and share knowledge from data.

Building on over thirty years’ experience in teaching and practising data science, the author encourages a programming-by-example approach to ensure students and practitioners attune to the practise of data science while building their data skills. Proven frameworks are provided as reusable templates. Real world case studies then provide insight for the data scientist to swiftly adapt the templates to new tasks and datasets.

The book begins by introducing data science. It then reviews R’s capabilities for analysing data by writing computer programs. These programs are developed and explained step by step. From analysing and visualising data, the framework moves on to tried and tested machine learning techniques for predictive modelling and knowledge discovery. Literate programming and a consistent style are a focus throughout the book.

Data Scientists have access to a grammar for preparing data (Hadley Wickham’s tidyr package in R), a grammar for data wrangling (dplyr), and a grammar for graphics (ggplot2).

At an R event hosted by CSIRO in Canberra in 2011 Hadley  noted that we are missing a grammar for machine learning. At the time I doodled some ideas but never developed. I repeat those doodles here. The idea’s are really just that – ideas as a starting point. Experimental code is implemented in the graml package for R which is refining the concepts first explored in the experimental containers package.

A grammar of machine learning can follow the ggplot2 concept of building layer upon layer to define the final model that we build. I like this concept rather than the concept of a data flow for model building. With a data flow a dataset is piped (in R using magrittr’s %>% operator) from one data wrangling step to the next data wrangling step. Hadley’s tidyr and dplyr do this really well.

The concept of a grammar of machine learning begins with recognising that we want to train a model:

train(ds, form(target ~ .))

Simply we want to train a model using some dataset ds where one of the columns of the dataset is named target and we expect to model this variable based on the other variables within the dataset (signified by the ~ .).

Generally in machine learning and statistical model building we split our dataset into a training dataset, a validation dataset, and a testing dataset. Some use only two datasets. Let’s add this in as the next “layer” for our model build.

train(ds, form(target ~ .)) +
  dataPartition(0.7, 0.3)

That is, we ask for 70% of the data randomly sampled to train the model.

We will have already performed our data preparation steps and let’s say that we know in ds the target variable has only two distinct values, yes and no. Thus a binary classification model is called for.

In R we have a tremendous variety of model building algorithms that support binary classification. My favourite has been randomForest so let’s add in our request to train a model using randomForest().

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +

Now we might want to do a parameter sweep over the mtry parameter to the randomForest() function which is the number of variables to randomly sample as we build each decision tree.

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1))

Finally to report on the evaluation of the model using the area under the curve (AUC).

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1)) +

The object returned is a trained model incorporating the additional information requested. Other operations can be performed on this model object, including its deployment into a production system!

We can provide parameters as a lightweight layer above other model building packages with no or minimal effort required to move to a new model builder.


(Image from

Graham @ Microsoft