Posts

Data scientists rely on the freedom to innovate that is afforded by open source software. We often deploy an open source software stack based on Ubuntu GNU/Linux and the R Statistical Software. This provides a powerful environment for the management, wrangling, analysis, modeling, and presentation of data within a tool that supports machine learning and artificial intelligence, including deep neural networks in R.

Whilst the open source software stack is also usually free of licensing fees we do still need to buy hardware on which to carry out our data science activities. Our own desktop and laptop computers will often suffice but as more data becomes available and our algorithms become more complex, having access to a Data Science Super Computer could be handy. The cloud offers cheap access to compute when you need it and the Azure Ubuntu Data Science Virtual Machine (DSVM) has become a great platform for my data science when I need it. The Ubuntu DSVM comes pre-installed with an extensive suite of all of the open source software that I need as a data scientist (including Rattle and RStudio).

A new data science virtual machine can be deployed with with a few clicks and some minimal information in less than 5 minutes. As our data and compute needs grow it can be resized to suit. Paying for just the compute as required (e.g., at 25 cents per hour) is an attractive proposition and powering down the server when not in use saves me considerably compared to having a departmental server running full time, irrespective of its workload. When not required we can deallocate the server to cost us nothing.  There is no need for expensive high-specification hardware sitting on-premise waiting for the high demand loads when they are needed. Simply allocate and resize the virtual machine as and when needed and pay for the hardware you need when you need, not just in case you need it.

The version of R provided with the Linux Data Science Virtual Machine is Microsoft’s R Server (closed source). This is based on the open source version of R but with added support for beyond RAM datasets of any size with parallel implementations of many of the machine learning algorithms for the data scientist. In the instructions below though please note that we replace Microsoft R Server as the default R with open source R. Both are then concurrently available on the server.

I begin with a link to obtaining a free trial subscription (if you don’t already have an Azure subscription) and then continue to set up the Ubuntu DSVM using the Azure Portal and configuring the new server with various extra Linux packages (that are not yet on the DSVM by default – but stay tuned) as well as an updated version of open source R and Rattle. Note that the deployment and setup of the DSVM can also be completed from R running on our own laptops or desktops using our new AzureDSVM R package. This then allows the process to be programmed.

The following looks like a lot of steps, and maybe so, but each is simple and the whole process is really straight forward. If you disagree, please let me know and we’ll work on it.

Obtain an Azure subscription

  1. A free trial subscription is available from azure.com. This is useful to get a feel for the capabilities of the Azure cloud and the costs involved. Costs apply only for the time the DSVM is deployed (irrespective of how much the CPU is utilised when it is deployed) so it is good practise to stop the server if you don’t need it for a period of time.

Create a Linux Data Science Virtual Machine

  1. Log on to the Azure Portal.
  2. Click on + New.
  3. Search the Marketplace for Linux Data Science Virtual Machine.
  4. Select the Data Science Virtual Machine for Linux (Ubuntu) from the search results.
  5. Read the description to see if it matches your requirements and then click on Create.
  6. Setup the Basics
    • Name the machine. E.g., dsvmxyz01.
    • Keep SSD as the VM disk type.
    • Provide a Username and Password. E.g., xyz and h%nHs72Gs#jK. (Using an SSH public key is preferred but beyond the scope of this introduction and can be set up later.)
    • Choose your Subscription.
    • Create a new Resource group and give it a name. E.g., dsvm_xyz_sea_res. A resource group is a logical collection of resources.
    • Choose a Location. E.g., Southeast Asia.
    • Click on OK. Your selections will be validated.
  7. Choose a server Size.
    • Choose a VM size. The configuration and monthly cost will be displayed for each. I generally start with the cheapest and rescale later as needed. Note that  $100 for a month is, very roughly, 15c per hour whilst it is Running and no charge whilst it is Stopped. You can later resize the server if you need a bigger one to get things done more quickly.
    • Click View all to see all server options.
    • Once you have decided then click on Select.
  8. In Settings
    • Check the default information and generally we go with the defaults unless we know otherwise.
    • Click on OK.
  9. In Purchase
    • Check the Offer Details, the Summary and the Terms of use.
    • If all is okay then click on Purchase.
  10. Wait while Deploying Linux Data Science Virtual Machine
    • This takes about 5 minutes.
    • The new VM appears in the default Dashboard.
  11. Set up a DNS name label (should be done during set up – how?)
    • Click the Public IP address
    • Click Configuration
    • Provide a DNS name label. E.g., dsvmxyz01.
    • Click on the Save icon at the top of the tile.
    • We can now refer to the server as dsvmxyz01.southeastasia.cloudapp.azure.com

Connect using X2Go on your local desktop (Linux or Windows)

  1. X2Go provides access to the remote desktop for the DSVM. See http://wiki.x2go.org for details.
    • For Windows local computer: download and run the X2Go client for windows.
    • For Ubuntu local computer using wajig:
      $ wajig install x2goclient
    • If it is not available then install it directly from X2Go
      $ wajig addrepo ppa:x2go/stable
      $ wajig update
      $ wajig install x2goclient
  2. Start-up X2Go and create a new Session (top left icon)
    • Session Name: dsvmxyz01
    • Host: dsvmxyz01.southeastasia.cloudapp.azure.com
    • Login: xyz
    • Session Type: XFCE
    • Click on OK
  3. Click on the new session as appears in the right hand column.
  4. Provide the password: h%nHs72Gs#jK
  5. Click on OK
  6. Click Yes on the Host key verification failed popup to accept the new server’s host key since this is the first time we have seen this new server.
  7. A desktop running on the remote virtual server will appear within a window on your local computer’s desktop.
  8. You can open up a Terminal Emulator from the bottom dock so as to continue on to tuning the server.
  9. An alternative is to simply connect to the new server using ssh (on GNU/Linux) or putty (on MS/Windows)
    $ ssh xyz@dsvmxyz01.southeastasia.cloudapp.azure.com
    Warning: Permanently added the ED25519 host key ...
    xyz@dsvmxyz01...'s password:
    Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4...)
    
    * Documentation: https://help.ubuntu.com
    * Management: https://landscape.canonical.com
    * Support: https://ubuntu.com/advantage
    
    Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud
    
    178 packages can be updated.
    1 update is a security update.
    ...
    $

Run RStudio Server on the DSVM

  1. The RStudio Server (and Desktop) is pre-installed on the Ubuntu DSVM but can be updated. Open up a Terminal Server (or ssh/putty connection) on the DSVM
    $ wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb
    $ wajig install rstudio-server-1.0.153-amd64.deb
    $ wget https://download1.rstudio.org/rstudio-xenial-1.0.153-amd64.deb
    $ wajig install rstudio-xenial-1.0.153-amd64.deb

    You may need to start the server: $ sudo rstudio-server start

  2. You will be asked for the user’s password in order to authorise the running of the RStudio server.
  3. Connect to http://dsvmxyz01.southeastasia.cloudapp.azure.com:8787  You will be warned that the connection is not secure. You should see the RStudion login page and if you are comfortable with the scurity warning then continue to provide your username and password. Note that encrypted RSA is used in transmitting the credentials so I believe it should be secure.
    Username: xyz
    Password: h%nHs72Gs#jK
    

Install support packages and latest R

  1. Connect to the DSVM through X2Go for a desktop experience and open up a Terminal Emulator.
  2. Update the operating system, install some utilities, and then reboot the server (note that you do not have to accept the EULA for the msodbcsql package as it is not required for the open source stack and can be removed):
    $ sudo apt-get install wajig
    $ wajig remove msodbcsql
    $ wajig update
    $ wajig distupgrade
    $ wajig install htop libcanberra-gtk-module
    $ sudo locale-gen "en_AU.UTF-8"
    $ sudo reboot
    
  3. Re-connect to the DSVM through X2Go to install and test the latest R
    $ wajig addrepo ppa:marutter/rrutter
    $ wajig addrepo ppa:marutter/c2d4u
    $ wajig update
    $ wajig distupgrade
    $ wajig install r-recommended r-cran-rattle r-cran-tidyverse
    $ wajig install r-cran-xml r-cran-cairodevice r-cran-rpart.plot
    $ sudo Rscript -e 'install.packages("rattle", repos="http://rattle.togaware.com")'
    
  4. Test out the R installation:
    $ R
    > library(rattle)
    > rattle()
    Click: Execute; Yes; Model tab; Execute; Draw; Close; Yes
    

Desktop R Studio

RStudio can be used through the browser from your local machine as we saw above, or else on the remote server’s desktop. For the latter:

  • Start-up RStudio (click on icon)
  • Notice message warning about Untrusted application launcher
  • Click Mark Executable
  • RStudio will start up.

Graham @ Microsoft