High Performance Data Management Issues in Data Mining

High Performance Data Management Issues
Workshop on Parallel and Distributed Data Mining, Melbourne, April 1998

Graham J. Williams
http://www.cmis.csiro.au/Graham.Williams
CSIRO, Mathematical and Information Sciences, Canberra

ACSys Data Mining

Hot Spots Data Mining

How we go about Hot Spots Data Mining in the Arcade Data Explorer:

Data Access Requirements

Characteristics of Requirements

Options for Data Management

Move the Data to the Algorithm (C5.0, CART, …)
Efficient memory management tricks
Move the Algorithm to the Data (Application Oriented Databases)
Generic Data Management facilities
Merge the Algorithms and the Data (Persistent Languages)
Not orthogonal but tuned for Data Mining
Maintain the Divide (Data Warehouses, Data Cubes, OLAP, CHESS?)
Rely on large body of DB researchers and commercial imperative

Applications Oriented DB: Phasme

Maintain the Divide

Relational Database Systems
- Geared towards transactional processing (read, write, integrity)
- Query optimisation focused on transactions
- Re-optimise database to suit OLAP and now Data Mining queries
Data Warehouse and Fast Data Cube generation/access
- Receiving considerable commercial attention
- Focus on read-only data is right for Data Mining
- Data Mining to piggy back on advances in
  Parallel and Distributed Data Warehouses?

Summary

Goal is seamless but common data management for analysis tools
Database technology is mature and addresses HPC
Persistence in Programming Languages and Applications Oriented Databases may provide specialised solutions
Explore how to tune RDBMS for efficient Data Mining through Data Cubes
Solution remains part of the database problem

[LaTeX -> HTML by ltoh]