{"id":165,"date":"2015-10-01T06:26:51","date_gmt":"2015-09-30T20:26:51","guid":{"rendered":"http:\/\/togaware.com\/?page_id=165"},"modified":"2021-03-29T12:06:38","modified_gmt":"2021-03-29T01:06:38","slug":"data-mining","status":"publish","type":"page","link":"https:\/\/togaware.com\/data-mining\/","title":{"rendered":"Data Science"},"content":{"rendered":"\n
\u201cSo what\u2019s getting ubiquitous and cheap? Data<\/b>. And what is complementary to data? Analysis<\/b>. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.\u201d<\/i> Professor Hal Varian, Chief Economist at Google, speaking to the New York Times in February 2009.<\/p>\n\n\n\n
The\u00a0Data Science Desktop Survival\u00a0Guide (R Edition)<\/a><\/strong>\u00a0provides a one page per concept guide to navigating your way around the world of data science using Free (Libre)\u00a0 and Open Source Software. The book is continually being updated and the recipes presented verified.\u00a0A PDF version is available<\/a> for a donation to help maintain these free resources.<\/p>\n\n\n\n Visit the Essentials of Data Science<\/strong><\/a> site for extensive material for hands-on entry into doing Data Science. The material supports the book: The Essentials of Data Science<\/a>.<\/p>\n\n\n\n Togaware Resources<\/b><\/p>\n\n\n\n Other Resources<\/b><\/p>\n\n\n\n Using R for Data Science The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.<\/p>\n\n\n\n R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU\/Linux with 32GB of main memory provides a powerful platform for data mining.<\/p>\n\n\n\n R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.<\/p>\n\n\n\n Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).<\/p>\n\n\n\n Open standards are important for users, but vendors resist them for obvious reasons, and would prefer to lock you in to their products. A number of commercial tools claim support of, for example, the open standard PMML for interoperability (sharing models between applications). But the support is patchy and not worth the effort. We have started a PMML effort in R to attempt to address the desire for interoperability.<\/p>\n\n\n\n Specific commercial statistical products are excellent in handling very large datasets. But they are limited in the analytic algorithms they provide. Commercial vendors, naturally, need to be convinced of the usefulness of implementing new algorithms. On the other hand, a vast selection has been available for deployment in R for a long time.<\/p>\n\n\n\n
<\/b><\/p>\n\n\n\n