High Performance Data Management Issues Workshop on
Parallel and Distributed Data Mining, Melbourne, April 1998 |
Graham J. Williams
http://www.cmis.csiro.au/Graham.Williams
CSIRO, Mathematical and Information Sciences, Canberra
- ACSys Data Mining (recap)
- Hot Spots Data Mining
- High Performance Data Management Issues
- Gigabyte size databases from HIC, NRMA, ATO, MSSSO, Medibank
- Domain expertise "on-tap''
- Real-world problems driving back-room research
- Parallel: Hot Spots; PRIM; GAMs; MARS; Thin Plate Splines
- Data Mining with the Arcade Data Explorer (Java, JFC/Swing, JDBC)
- Data Management remains a crucial issue
How we go about Hot Spots Data Mining in the Arcade Data Explorer:
- Segmentation: Distance measuring for a group of (closest) data
- Rule Induction: Counting members of a group
- Evolution: Fitness evaluation of group
- Visualisation: Can not currently sensibly visualise very large
groups
Characteristics of Requirements |
- Extremely large data warehouses
- Need quick summaries of the data (univariate over millions)
- Many Data Mining operations partition data and operate on groups
- Calculations are often column-based rather than row-based
Options for Data Management |
- Move the Data to the Algorithm (C5.0, CART,
)
Efficient memory management tricks
- Move the Algorithm to the Data (Application Oriented
Databases)
Generic Data Management facilities
- Merge the Algorithms and the Data (Persistent
Languages)
Not orthogonal but tuned for Data Mining
- Maintain the Divide (Data Warehouses, Data Cubes, OLAP,
CHESS?)
Rely on large body of DB researchers and commercial
imperative
Applications Oriented DB: Phasme |
- Application Oriented Database System
- Dynamically customise the structure of the database to suit application
- Employs plug-ins to effect this
- Move the data model from the database to the application
- Exploring integration of ArcadeDX with Phasme being explored
- Relational Database Systems
- Geared towards transactional processing (read, write, integrity)
- Query optimisation focused on transactions
- Re-optimise database to suit OLAP and now Data Mining queries
- Data Warehouse and Fast Data Cube generation/access
- Receiving considerable commercial attention
- Focus on read-only data is right for Data Mining
- Data Mining to piggy back on advances in
Parallel and
Distributed Data Warehouses?
- Goal is seamless but common data management for analysis tools
- Database technology is mature and addresses HPC
- Persistence in Programming Languages and Applications Oriented
Databases may provide specialised solutions
- Explore how to tune RDBMS for efficient Data Mining through Data
Cubes
Solution remains part of the database problem
[LaTeX -> HTML by ltoh]
CSIRO Disclaimer applies.
Copyright © 1998 Graham J. Williams
(Graham.Williams@cmis.csiro.au)
Last modified: 19 Apr 1998
(LaTeX doc modified: 14 Apr 1998)