Rattle, an open source GUI For Data Science and Machine Learning using R, has been updated to Version 5.1 on CRAN and is available for download now. As always, the latest updates to Rattle are available from  bitbucket.

A small but important update means that Rattle now works with the latest version of RGtk2 on CRAN.

A significant update is that the default boosting algorithm in Rattle is now xgboost, currently the most popular ensemble tree builder and a regular star performer with Kaggle. Rattle allows you to quickly get up to speed with using xgboost through the GUI and the automatically generated R code template.

Also with this release Rattle supports ggraptR for interactive generation of ggplot graphics. This requires the latest version or ggraptR available from github.

The Log tab continues to evolve to produce better R template code that is well written and documented, using the tidyverse functionality. This continues to improve and aims to follow the guidelines in my recent template-oriented book, Essentials of Data Science, in R

I have also made a Docker image available so that anyone can run Rattle without any installation required (except for installing Docker and loading up a copy of the image). This is available from the Docker Hub where you can also find instructions for setting up Docker. Indeed, check out my recent blog post on three options for setting up Rattle, including the use of the cloud-based Data Science Virtual Machine on Azure.

To install the latest release of Rattle from CRAN:

> install.packages("rattle")

 

Data Scientists have access to a grammar for preparing data (Hadley Wickham’s tidyr package in R), a grammar for data wrangling (dplyr), and a grammar for graphics (ggplot2).

At an R event hosted by CSIRO in Canberra in 2011 Hadley  noted that we are missing a grammar for machine learning. At the time I doodled some ideas but never developed. I repeat those doodles here. The idea’s are really just that – ideas as a starting point. Experimental code is implemented in the graml package for R which is refining the concepts first explored in the experimental containers package.

A grammar of machine learning can follow the ggplot2 concept of building layer upon layer to define the final model that we build. I like this concept rather than the concept of a data flow for model building. With a data flow a dataset is piped (in R using magrittr’s %>% operator) from one data wrangling step to the next data wrangling step. Hadley’s tidyr and dplyr do this really well.

The concept of a grammar of machine learning begins with recognising that we want to train a model:

train(ds, form(target ~ .))

Simply we want to train a model using some dataset ds where one of the columns of the dataset is named target and we expect to model this variable based on the other variables within the dataset (signified by the ~ .).

Generally in machine learning and statistical model building we split our dataset into a training dataset, a validation dataset, and a testing dataset. Some use only two datasets. Let’s add this in as the next “layer” for our model build.

train(ds, form(target ~ .)) +
  dataPartition(0.7, 0.3)

That is, we ask for 70% of the data randomly sampled to train the model.

We will have already performed our data preparation steps and let’s say that we know in ds the target variable has only two distinct values, yes and no. Thus a binary classification model is called for.

In R we have a tremendous variety of model building algorithms that support binary classification. My favourite has been randomForest so let’s add in our request to train a model using randomForest().

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest)

Now we might want to do a parameter sweep over the mtry parameter to the randomForest() function which is the number of variables to randomly sample as we build each decision tree.

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1))

Finally to report on the evaluation of the model using the area under the curve (AUC).

train(ds, formula(target ~ .)) +
  dataPartition(0.7, 0.3) +
  model(randomForest) +
  tuneSweep(mtry=seq(5, nvars, 1)) +
  evaluate_auc()

The object returned is a trained model incorporating the additional information requested. Other operations can be performed on this model object, including its deployment into a production system!

We can provide parameters as a lightweight layer above other model building packages with no or minimal effort required to move to a new model builder.

  model(randomForest::randomForest, 
        ntree=100, 
        mtry=4, 
        importance=TRUE,
        replace=FALSE, 
        na.action=randomForest::na.roughfix)

(Image from http://grammar.ccc.commnet.edu/grammar/)

Graham @ Microsoft

Welcome to the new togaware.com site. After over a year living on togaware.net I’ve finally moved the test site over to the main togaware.com site and this is what you are now viewing!

You’ll find all of the Togaware resources here still (somewhere or other…). They are being migrated across to the new format bit by bit. If you can’t find something do let me know (add a comment below). We’ll try an find it again!

Stay tune for more blog posts and new activities linked with Togaware.