Middleware for data mining applications on clusters and grids

Authors:
Leonid Glimcher;Ruoming Jin;Gagan Agrawal
Affiliations:
Department of Computer Science and Engineering, Ohio State University, 2015 Neil Avenue, Columbus, OH 43210, USA;Department of Computer Science, Kent State University, Kent, OH 44242, USA;Department of Computer Science and Engineering, Ohio State University, 2015 Neil Avenue, Columbus, OH 43210, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 18
Cited 3

Algorithms for clustering data

Algorithms for clustering data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
T2: a customizable parallel database for multi-dimensional data

ACM SIGMOD Record
Parallelizing Image Feature Extraction on Coarse-Grain Machines

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data mining: concepts and techniques

Data mining: concepts and techniques
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Infrastructure for Building Parallel Database Systems for Multi-Dimensional Data

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Shared Memory Parallelization of Decision Tree Construction Using a General Data Mining Middleware

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Mining of Association Rules in Very Large Databases: A Structured Parallel Approach

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Grid-Based Knowledge Discovery Services for High Throughput Informatics

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Parallel Classification for Data Mining on Shared-Memory Multiprocessors

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Detection and Visualization of Anomalous Structures in Molecular Dynamics Simulation Data

VIS '04 Proceedings of the conference on Visualization '04
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
Distributed data mining on grids: services, tools, and applications

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

A service-oriented medical framework for fast and adaptive information delivery in mobile environment

IEEE Transactions on Information Technology in Biomedicine - Special section on body sensor networks
An empirical study on mining sequential patterns in a grid computing environment

Expert Systems with Applications: An International Journal
Efficient algorithms for frequent pattern mining in many-task computing environments

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper gives an overview of two middleware systems that have been developed over the last 6 years to address the challenges involved in developing parallel and distributed implementations of data mining algorithms. FREERIDE (FRamework for Rapid Implementation of Data mining Engines) focuses on data mining in a cluster environment. FREERIDE is based on the observation that parallel versions of several well-known data mining techniques share a relatively similar structure, and can be parallelized by dividing the data instances (or records or transactions) among the nodes. The computation on each node involves reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction. The reduction involves only commutative and associative operations, which means the result is independent of the order in which the data instances are processed. After the local reduction on each node, a global reduction is performed. This similarity in the structure can be exploited by the middleware system to execute the data mining tasks efficiently in parallel, starting from a relatively high-level specification of the technique. To enable processing of data sets stored in remote data repositories, we have extended FREERIDE middleware into FREERIDE-G (FRamework for Rapid Implementation of Data mining Engines in Grid). FREERIDE-G supports a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. The added functionality in FREERIDE-G aims at abstracting the details of remote data retrieval, movements, and caching from application developers.