The national scalable cluster project: three lessons about high performance data mining and data intensive computing

Authors:
Robert Grossman;Robert Hollebeek
Affiliations:
University of Illinois at Chicago and Magnify Inc., Chicago, IL;University of Pennsylvania and Hubs Inc., Philadelphia, PA
Venue:
Handbook of massive data sets
Year:
2002

Citing 6
Cited 0

Original Contribution: Stacked generalization

Neural Networks
Data-intensive computing and digital libraries

Communications of the ACM
Papyrus: a system for data mining over local and wide area clusters and super-clusters

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A Case for NOW (Networks of Workstations)

IEEE Micro
The Globus Project: A Status Report

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Legion-a view from 50,000 feet

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention to I/O paths. Also, exploiting data parallelism at the file and record level provides efficient mapping of data-intensive problems onto clusters and is particularly well suited to data mining. Finally, the repetitive nature of data mining demands special attention be given to data layout on the hardware and to software access patterns while maintaining a storage schema easily derived from the legacy form of the data.