My new book is now available from Amazon.

From the cover:

The Essentials of Data Science: Knowledge Discovery Using R presents the concepts of data science through a hands-on approach using free and open source software. It systematically drives an accessible journey through data analysis and machine learning to discover and share knowledge from data.

Building on over thirty years’ experience in teaching and practising data science, the author encourages a programming-by-example approach to ensure students and practitioners attune to the practise of data science while building their data skills. Proven frameworks are provided as reusable templates. Real world case studies then provide insight for the data scientist to swiftly adapt the templates to new tasks and datasets.

The book begins by introducing data science. It then reviews R’s capabilities for analysing data by writing computer programs. These programs are developed and explained step by step. From analysing and visualising data, the framework moves on to tried and tested machine learning techniques for predictive modelling and knowledge discovery. Literate programming and a consistent style are a focus throughout the book.

The fully open source software stack of the Ubuntu Data Science Virtual Machine (DSVM) hosted on Azure is a great place to support an R workshop or laboratory session or R training. I record  here the simple steps to set up a Linux Data Science Virtual Machine (in the main so I can remember how to do it each time).  Workshop attendees will have their own laptop computers and can certainly install R themselves but with the Ubuntu Data Science Virtual Machine we have a shared and uniformly configured platform which avoids the traditional idiosyncrasies and frustrations that plague a large class installing software on multiple platforms themselves. Instead of speding the first trouble filled hour of a class setting up everyone’s computer we can use a local browser to access either Jupyter Notebooks or RStudio Server running on the DSVM.

Jupyter Notebooks on JupyterHub

We illustrate the session with both Jupyter Notebook supporting multiple users under JupyterHub and as a backup running RStudio Server (for those environments where a secure connection through https is not permitted). Both can be accessed via browsers. JupyterHub uses https (encrypted) which may be blocked by firewalls within organisations. In that case an RStudio Server over http is presented as a backup.

WARNING: Jupyter Notebook has been able to render my laptop computer (under both Linux and Windows, Firefox and IE) unusable after a period of extensive usage when the browser freezes and the machine becomes completely unresponsive.

Jupyter Notebook provides a browser interface with basic literate programming capability. I’ve been a fan of literate programming since my early days as a programmer in the 1980’s when I first came across the concept from Donald Knuth. I now encourage literate data science and it is a delight to see others engaged is urging this approach to data science. Jupyter Notebooks are great for self paced learning intermixing a narrative with actual R code. The R code can be executed in place with results displayed in place as the student works through the material. Jupyter Notebooks are not such a great development environment though. Other environments excel there.

JupyterHub supports multiple users on the one platform, each with their own R/Jupyter process. The Linux Data Science Virtual Machine running on Azure provides these open source environments out of the box.  Access to JupyterHub is through port 8000.

Getting Started – Create a Ubuntu Data Science Virtual Machine

To begin we need to deploy a Ubuntu Data Science Virtual Machine. See the first two steps on my blog post. A DS14 server (or D14_V2 for a SSD based server) having 16 cores and 112 GB of RAM seems a good size (about $40 per day).

We may want to add a disk for user home folders as they can sometimes get quite large during training. To do so follow the Azure instructions:

  1. In the Portal click in the virtual machine.
  2. Click on Disks and Attach New.
  3. Choose the Size. 1000GB is probably okay for a class of 100.
  4. Click OK (takes about 2 minutes).
  5. Now log in to the server through ssh:
  6. The disk is visible as /dev/sdd
    • $ dmesg | grep SCSI
  7. Format the disk
    • $ sudo fdisk /dev/sdd
    • Type
      • n (new partition)
      • p (primary)
      • <enter> (1)
      • <enter> (2048)
      • <enter> (last sector)
      • p (create partition)
      • w (write partition)
    • $ sudo mkfs -t ext4 /dev/sdd1
  8. Create a temporary mount point and mount
    • $ sudo mkdir /mnt/tmp
    • $ sudo mount /dev/sdd1 /mnt/tmp
    • $ mount | grep /sdd1
  9. We will use this disk to mount as /home by default, so set that up
    • Check how much disk is used for /home
      • $ sudo du -sh /home
    • Synchronise /home to the new disk
      • $ sudo rsync -avzh /home/ /mnt/tmp/
    • Identify the unique identifier for the disk
      • $ sudo -i blkid | grep sdd1
    • Tell the system to mount the new disk as /home
      • $ sudo emacs /etc/fstab
      • Add the following single line with the appropriate UUID
        UUID=f395b783-31da-4916-a3a9-8fb56fd7a068 /home ext4 defaults,nofail,discard 1 2
    • Now mount the new disk as /home
      • $ sudo mount /home
    • No longer need the temporary mount so unmount
      • $ sudo umount /mnt/tmp
    • Move to the new version of home and ensure ssh can access
      • $ cd ~
      • $ df -h .
      • $ sudo restorecon -r /home

Connecting to JupyterHub

If you set up a DNS name label dsvmxyz01 and the location is southeastasia then visit:

First time you connect to the site you will be presented with a warning from the browser that the connection is insecure. It is using a self signed certificate to assure the encryption between your browser and the server. That is fine though a little disconcerting. As the user you could simply click through to allow the connection and add an exception. This often involves clicking on Advanced and then Add Exception… and then Confirm Security Exception. It is safe to provide an exception for now. However, best to install a proper certificate!

Install a LetsEncrypt Certificate

We can instead install a free Let’s Encrypt certificate from letsencrypt to have a valid non-self-signed certificate. To do so we first need to allow connection through the https: port (443) through the Azure portal for the dsvm. Then log on to the server and do the following:

$ ssh
$ sudo yum install epel-release
$ sudo yum install httpd mod_ssl python-certbot-apache
$ sudo emacs /etc/httpd/conf.d/ssl.conf
  Within the Virtual Host entry add
    # SSLProtocol all -SSLv2
$ sudo systemctl restart httpd
$ sudo systemctl status httpd
$ sudo certbot --apache -d
$ sudo systemctl start httpd

You should be able to connect now without the certificate warning.

You are presented with a Jupyter Hub Sign in page.

Screenshot from 2016-07-29 09:14:41

Creating User Accounts

Log in to the server. This will depend on whether you set up a ssh-key or a username and password. We assume the latter for this post. On a terminal (or using Putty on Windows), connect as:

$ ssh

You will be prompted for a password.

We can then create user accounts for each user in our workshop. The user accounts are created on the Linux DSVM. Here we create 40 user accounts and record their random usernames and passwords into the file usersinfo.csv on the server:

for i in {1..40}; do 
  u=`openssl rand -hex 2`
  sudo adduser user$u --gecos "" --disabled-password
  p=`openssl rand -hex 5`
  echo "user$u:$p" | sudo chpasswd
  echo user$u:$p >> 'usersinfo.csv'

If the process has issues and you need to start the account creation again then delete the users:

for i in $(cut -d ":" -f1 usersinfo.csv); do 
  sudo deluser --remove-home $i; 

# Check it has been done

tail /etc/passwd
ls /home/

Provide a username/passwd to each participant of the workshop, one line only to each user. The file will begin something like:


Now go back to and Sign in with the Username userce81 and Password d0dfac5a30 (using the username and password from your own usersinfo.csv file.)

Once logged in Jupyter will display a file browser.


Notice a number of notebooks are available. Click the IntroTurorialInR.ipynb for a basic introduction to R.

Screenshot from 2016-07-29 09:15:49

Backup Option – RStudio

JupyterHub requires https and so won’t run internally within a customer site if they have a firewall blocking all SSL (encrypted) communications. In this case RStudio server is a backup option. It is pre-installed on the server and if you followed my instructions above for deploying a DSVM you will hav updated to the latest version too.

Connect to the RStudio server:

Sign in to RStudio with the same Username and Password as above.


Running Rattle through an X2Go Desktop

If you followed my DSVM deployment guide then you will have also set up X2Go on your local computer to support a desktop connection across to the DSVM. This is very convenient in terms of running desktop apparitions, like Rattle,  on the DSVM. Every student in the class gets the same environment.

Shortcuts to the Services

The URLs are rather long and so we can set up either or shortcuts. Visiting the latter we set up two short URLs: as as

We can now use the short URLs to refer to the long URLs.

REMEMBER: Deploy-Compute-Destroy for a cost effective hardware platform for Data Science. Deallocate (Stop) your server when it is not required.

Graham @ Microsoft

I had the privilege to join a panel in 2014 that explored big data opportunities and challenges. Together, coordinated by Professor Zhi-Hua Zhou, we captured our thoughts into a paper published in the IEEE Computational Intelligence Magazine (Volume 9, Number 4).

It is an honour to learn that we have received a 2017 IEEE Outstanding Paper Award. The paper is:

Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, Graham J. Williams. “Big data opportunities and challenges: Discussions from data analytics perspectives”, IEEE Computational Intelligence Magazine, vol. 9, no. 4, 2014 November, pp.62-74.

The paper includes a discussion of turning ensemble concepts into the extreme, reflecting on the need for the pendulum to swing back toward protecting privacy, and the resulting focus on massively ensembled models, each “model” modelling an individual across extensive populations. The award was bestowed in November 2017.

A 5-video series called Data Science for Beginners has been released by Microsoft. It introduces practical data science concepts to a non-technical audience… making data science accessible – keeping the language clear and simple as an entry point to understanding data science.

Graham @ Microsoft

I saw a demo of a package for Rapid and Pretty Things in R earlier in the year when it was a work in progress. It is now live on GitHub (but not yet CRAN). It allows you to very quickly visualise data in R using a Shiny GUI to generate ggplot2 underneath. A nice app for some visual analytics.


Screenshot-raptR - Mozilla Firefox

The Australian Government’s Data Analytics Centre of Excellence has released a new resource for employees of the Australian Government to freely interact and share experiences and knowledge. The resource is hosted on the AnalyticsSpace and uses the Askbot open source software which is modelled on StackOverflow. If you are an Australian Government employee working with data and analytics then you can join the community.

Visit Q&A on AnalyticsSpace.

Welcome to the New Togaware Presence.

Here you will find resources for the Data Scientist.

The site is under development. More material is available from