Professor Graham Williams is Chief Scientist of the Software Innovation Institute at the Australian National University. Prior to joining the ANU in 2020 he was Director of Data Science for Asia Pacific, Microsoft, 2016-2020, previously being Lead Data Scientist with the Australian Government’s Data Analytics Centre of Excellence, and Senior Data Scientist at the Australian Taxation Office. In the 1990’2 he was Principal Computer Scientist for Data Mining with CSIRO, the Australian governments premier research organisation.
Graham has lead projects in data mining since the 1980’s as a researcher, educator, and practitioner in areas of health, banking, insurance, finance, taxation, fraud identification, immigration, customs, and government. He has developed open source software and web services for data mining. His latest open source, open AI, endeavour is the mlhub which aims to make AI, Machine Learning and Data Science accessible to anyone with an interest – and interest is coming from many ranging from high school students learning about AI, university students studying AI, and enterprises deploying AI into production.
Graham’s research contributions include Ensemble Decision Tree Induction (1987) as a precursor to Random Forests and other ensemble machine learning algorthms, HotSpots for identifying interestingness in very large data collections (1997), WebDM data mining services pioneering the use of XML (1995), and Rattle (2005), a popular and simple to use Graphical User Interface for data mining using R, deployed world wide. See a list of over 70 books and articles on the Publications Page and more background on his Bio Page.
In May 2020 at the 24th Pacific Asia Conference on Knowledge Discovery and Data Mining, Graham received the prestigious PAKDD Achievement Award. The award, one of only two ever awarded by the international steering committee over 24 years, is presented in recognition of extraordinary and ongoing contributions in research and service to the advancement of the PAKDD community and series of conferences.
In 2011 at the Pacific Asia Conference on Knowledge Discovery and Data mining Graham received the PAKDD Distinguished Contribution award. The award is presented to recognise significant and continued contributions in research and services to the advancement of the research community.
Graham’s popular 2011 book Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R) is available from Springer as part of the Use R! series. The free and open source Rattle data mining software and the companion Data Mining Desktop Survival Guide continue to grow in content, userbase, readership. Rattle has also been released as Information Builder’s RStat predictive analytics for WebFOCUS.
Graham’s 2017 book The Essentials of Data Science: Knowledge Discovery Using R is available from CRC Press as part of The R Series. The book takes a template-based approach to building confidence in programming with data in the free and open source R Statistical Software. The companion Essentials web site contains ready to run material for the data scientist whilst extensive resources for the data scientist including many template-oriented resources can be found on the OnePageR web site.
Some of Graham’s presentations and interviews can be found on the presentations page.
Professor and Chief Scientist (2020-present)
Software Innovation Institute, Australian National University
Director of Data Science (2016-2020)
Microsoft Asia Pacific
Senior Director and Data Scientist (2004-2016)
Enterprise Analytics, Australian Taxation Office
Principal Research Scientist (1992-2004)
Researcher and Academic (1989-1992)
Computer Science, The Australian National University
Visiting Professor (2011-2014)
Shenzhen Institutes of Advanced Computing
Chinese Academy of Sciences
Adjunct Professor (2005-present)
University of Canberra
Australian National University
Chief Data Scientist (2002-present)
Togaware Pty Ltd
- Technical and Strategic Leadership of Data Science Teams
- Data Mining Research and Consultancies
- Programming Data Analyses using R Statistical Software
- Data Mining Software Development using R, Python, and Julia
- Machine Learning algorithms
- Training Courses and Books in Data Mining
- Debian and Ubuntu GNU/Linux and Open Source Software
- Knowledge-based systems implementation
- Software engineering, literate programming, R, Python, C, C++, Java
- XML, SOAP, Web Services
- Freedom in access to data science
- Ensemble models, social network analysis, cloud computing
- Machine Learning, Artificial Intelligence and Knowledge-Based Systems
- Open source software and freedom in computing
- Data Mining in Fraud, Non-Compliance, Security
- Data Mining in Health, Finance, Insurance
- PhD in Computer Science, Australian National University, Canberra, Australia, 1991. Title: Inducing and Combining Decision Structures for Expert Systems. Contribution: Introducing concept of ensembles of decision trees. Supervisor: Professor Robin Stanton.
- BSc (Maths) (First Class Honours) in Computer Science, University of Adelaide, South Australia, Australia, 1983.1991: PhD (Australian National University)
- Chief Scientist, Software Innovation Institute, the Australian National University (2020-present).
- Director of Data Science, Asia Pacific, Microsoft, building and leading a top team of Data Scientists supporting organisations adopting and moving to the Microsoft data science platform now encompassing Open Source Software, R Statistical Software, and Azure Cloud. (2016-2020)
- Senior Director and Data Scientist, Australian Taxation Office, deploying over 100 analytics models into production using open source software. (2004-2016)
- Consultant to Information Builders, New York, overseeing implementation of RStat module for WebFOCUS (2008–2011).
- Principal Research Scientist, Commonwealth Scientific and Industrial Research Organisation (CSIRO) Mathematical and Information Sciences, Australia. Applying research in machine learning and spatial analysis, and forming the first data mining research team in Australia by 1995 (1992–2004).
- Academic, Computer Science, the Australian National University. (1989-1992)
- Advanced Technologies Consultant, HiSoft Expert Systems, Melbourne, Australia, implementing an expert system for Esanda Finance to assess loan applications and used for over 10 years (1989–1990).
- Research and Development Manager, BBJ Computers International Melbourne, Australia, managing a team implementing decision tree induction algorithm based on his PhD research (1987–1988).
- Distinguished Contribution Award from the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), for significant and continued contributions in research and services to the advancement of the PAKDD community, 2011.
- Commissioner’s Award for Innovation, Australian Taxation Office, for leadership of the e-tax group fraud project, 2010.
- Australia Day Medallion, for significant contribution to the broader community through the development of open source software for data mining, 2007.
- The High Performance Computing Challenge Most Innovative Award, Super Computing, Orlando, Florida, for an open test bed for managing, mining and modelling massive and distributed data. This consisted of an international collaboration of connected super computers performing distributed data mining calculations. 1998.
- Best paper, International Conference on Expert Systems, Avignon, France, describing an expert system for bush fire management in Kakadu National Park, Australia, 1986.
- RStat: A commercial module for Information Builder’s WebFOCUS business intelligence product, 2008.
- rattle: An R package released in 2005 bringing together over 100 other R packages that are useful to the data scientist, exposed through a graphical user interface. http://togaware.com/rattle/.
- pmml : An R package released in 2007 to export R predictive analytics models as PMML which can then be imported into other tools. https://cran.r-project.org/web/packages/pmml/.
- wskm: An R package developed with the Chinese Academy of Sciences and released in 2014 for entropy weighted k-means sub-space clustering over large datasets. https://cran.r-project.org/web/packages/wskm/
- wsrf : An R package developed with the Chinese Academy of Sciences and released in 2014 for the parallel implementation of weighted subspace random forest. https://cran.r-project.org/web/packages/wsrf/
- wajig: A system and package manager for Debian and Ubuntu based GNU/Linux systems developed (1995-2005). https://wiki.debian.org/Wajig
- TeX Catalogue: XML based cataloging software for CTAN and the TeX community (1986-2011). http://texcatalogue.ctan.org/
- Two single author books: The Essentials of Data Science: Knowledge Discovery Using R, and Data Mining with Rattle and R: The art of excavating data for knowledge discovery.
- 3 internet books: Data Mining, GNU/Linux, TeX Catalogue.
- 9 co-edited volumes, including New Frontiers in Applied Data Mining. Lecture Notes in Computer Science Volume 5669, September 2010, Springer-Verlag.
- 7 book chapters including: Rattle and Other Data Mining Tales. In Journeys to Data Mining Experiences of 15 Renowned Scientists Mohamed Medhat
Gaber (Ed.) Springer, 2012.
- 18 journal papers including Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives, IEEE Computational Intelligence
Magazine, 2014 and On-line Unsupervised Outlier Detection Using Finite
Mixtures with Discounting Learning Algorithms, Data Mining and Knowledge Discovery, 8(3), 2004.
- 35 peer reviewed conference papers including Mining Risk Patterns in Medical Data, ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2005.
- American Association for Artificial Intelligence
- Association for Computing Machinery
- IEEE Computer Society
- Australian Computer Society
- Fellow, Institute of Analytics Professionals of Australia
Professional Network and Community Engagement
- Chair for over 15 International Conferences and Program Committees including ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Australasian Data Mining Conference (AusDM), Australian Artificial Intelligence Conference, International Conference on Simulated Evolution And Learning (SEAL).
- Steering Committee Chair, Treasurer, Member of Pacific-Asia International Conference on Knowledge Discovery and Data Mining, PAKDD http://pakdd.org/scmembers.html, Australasian Data Mining Conference http://ausdm.org/steering.html, and Australian Computer Society National Committee on Artificial Intelligence and Expert Systems https://www.acs.org.au/communities/artificial-intelligence-committee/committee-members.
- Program Committee Co-Chair (Industry and Government), ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015, 10-13 August 2015, Sydney. http://kdd.org/kdd2015/organizers.html
- Co-Chair ACM SIGKDD Australia and New Zealand Chapter (2014-present). http://datamining.it.uts.edu.au/anzkdd/
- Over 30 keynote, panel, and invited presentations to conferences and workshops and tutorials, including Workshop on Data Modelling, Mining & Analytics, 5-7 May 2015, International Centre for Free and Open Source Software (ICFOSS), Trivandrum, Kerala, India http://icfoss.org/R/, and Excavating Knowledge from Data, 7 March 2014, CUNY Data Mining Initiative, New York, http://datamining.ws.gc.cuny.edu/2014/03/24/graham-williams-workshop/, and Ensembles and model delivery for tax compliance invited talk ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing.
- Member of the Australian GovHack judging panel for 2015. https://www.govhack.org/
- Founder and coordinator (2014–present) of Data Science Canberra Meetup. http://www.meetup.com/Data-Science-Canberra/
Consultancies and Projects
- Australian Customs Service: Advanced Risk Profiling 2003.
- Australian Taxation Office: Integrated Compliance 2000, Compliant Returns 1999.
- Commonwealth Bank of Australia: Home equity facility 2000.
- Commonwealth Department of Health and Aging: Variation in Clinical Outcomes 2003, General Practitioner and Hospital Care Priority Area Costs 2003, Rare Adverse Drug Reactions 2003, GP Workforce Model 2003, Queensland Linked Data 2002, Hospitalizations for Ambulatory Care Sensitive Conditions 2002, Pre-admission and post-discharge activity 2002, GP workforce projection 2002, Trend analysis of the general practitioner workforce 2002, Future health costs for Australia 2001
- Credit Reference Association of Australia: Manual review record matching 1997.
- Esanda Finance, Australian and New Zealand (ANZ) Bank : Expert System for Loan Approvals 1989.
- Health Insurance Commission: Pathology Explorer 2001, Medicare claim patterns 1998, Operator profiles 1997.
- Insurance Australia Group: Analysis of motor vehicle claims data 1997.
- Queensland Health: Patient Profiling, 2003
- The book Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery published by Springer in 2011. Used extensively world wide for teaching data mining.
- Ensemble Learning: Multiple Decision Tree Induction PhD research 1986 to 1990.
- Demonstrated use of ID3 to build an ensemble of classifiers rather than just a single classifier.
- Identified that decision tree induction algorithms like ID3 can generate multiple decision trees that can be combined to produce classifiers that perform better than the individual classifiers.
- Integrated into commercial Database Management System in 1988 by BBJ Computers, into the TODAY 4GL.
- Commercial Knowledge-Based Expert Systems Development
- Victorian Government Departments 1987: VBARS combined business advice from multiple agencies to provide a single point of entry for the delivery of this advice to the public.
- Esanda Finance 1989: Credit Assessment System. One of the first commercial Australian expert systems and still in daily operation in 2000, providing credit assessments for distributed on-line application for credit from motor vehicle dealers.
- Spatial Reasoning
- One of the earliest spatial expert systems developed for fire management at Kakadu National Park 1985
Awarded best paper at International conference, Avignon, France.
- Spatial reasoning systems based on expectations integrated within GIS 1991
- One of the earliest spatial expert systems developed for fire management at Kakadu National Park 1985
- Methodologies for Exploring for Interesting Discoveries
- Hot Spots Data Mining Methodology 1994
Combine rule induction with clustering to identify interesting groups within very large datasets.
- Rule Evolver for Interesting Rule Discovery 1997
Employ evolutionary techniques to assist in the task of identify interesting patterns in extremely large databases.
- Statistical Outlier Detection with NEC Japan
Identify under-represented patterns within extremely large databases.
- Fraud Detection for the Health Insurance Commission 1998
Successful identification of cases of fraudulent behaviour
- Hot Spots Data Mining Methodology 1994
- Other Data Mining Activities
- With Peter Milne and Joshua Huang, set up the first data mining research and applications group in data mining in Australia in 1995, within CSIRO, Canberra.
- Linking and Cleaning distributed administrative health databases 2002