Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

Authors:
Ruoming Jin;Ge Yang;Gagan Agrawal
Affiliations:
-;-;IEEE Computer Society
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 47
Cited 25

Algorithms for clustering data

Algorithms for clustering data
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
C4.5: programs for machine learning

C4.5: programs for machine learning
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast sequential and parallel algorithms for association rule mining: a comparison

Fast sequential and parallel algorithms for association rule mining: a comparison
An effective hash-based algorithm for mining association rules

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Scalable parallel data mining for association rules

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Compiling object-oriented data intensive applications

Proceedings of the 14th international conference on Supercomputing
Adaptive reduction parallelization techniques

Proceedings of the 14th international conference on Supercomputing
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors

Proceedings of the 14th international conference on Supercomputing
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Oracle parallel processing

Oracle parallel processing
Data mining: concepts and techniques

Data mining: concepts and techniques
Parallel data mining for association rules on shared-memory multi-processors

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Performance prediction for random write reductions: a case study in modeling shared memory programs

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Parallel data mining for association rules on shared memory systems

Knowledge and Information Systems
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey

Data Mining and Knowledge Discovery
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Parallel and Distributed Association Mining: A Survey

IEEE Concurrency
Strategies for Parallel Data Mining

IEEE Concurrency
Parallel Programming with Polaris

Computer
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
Scalable Parallel Data Mining for Association Rules

IEEE Transactions on Knowledge and Data Engineering
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Parallel Formulations of Decision-Tree Classification Algorithms

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
An efficient association mining implementation on clusters of SMP

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Mining of Association Rules in Very Large Databases: A Structured Parallel Approach

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Parallel Classification for Data Mining on Shared-Memory Multiprocessors

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient C4.5

Efficient C4.5
Mechanisms for efficient shared-memory, lock-based synchronization

Mechanisms for efficient shared-memory, lock-based synchronization
Compiler and runtime support for shared memory parallelization of data mining algorithms

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Parallelizing a Defect Detection and Categorization Application

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Distributed computation of the knn graph for large high-dimensional point sets

Journal of Parallel and Distributed Computing
Fast split selection method and its application in decision tree construction from large databases

International Journal of Hybrid Intelligent Systems - Hybrid Intelligence using rough sets
On the optimal working set size in serial and parallel support vector machine learning with the decomposition algorithm

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Active semantic caching to optimize multidimensional data analysis in parallel and distributed environments

Parallel Computing
Middleware for data mining applications on clusters and grids

Journal of Parallel and Distributed Computing
Optimization of frequent itemset mining on multiple-core processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An analytical model of locality-based parallel irregular reductions

Parallel Computing
A Vision for Cyberinfrastructure for Coastal Forecasting and Change Analysis

GeoSensor Networks
Distributed Management of Massive Data: An Efficient Fine-Grain Data Access Scheme

High Performance Computing for Computational Science - VECPAR 2008
Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core Cluster

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Parallel fuzzy c-means cluster analysis

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Porting decision tree algorithms to multicore using fastflow

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Tree partition based parallel frequent pattern mining on shared memory systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A generic parallel processing model for facilitating data mining and integration

Parallel Computing
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Interactive data mining on a CBEA cluster

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
HyParSVM: a new hybrid parallel software for support vector machine learning on SMP clusters

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Data mining with parallel support vector machines for classification

ADVIS'06 Proceedings of the 4th international conference on Advances in Information Systems
Parallel nearest neighbour algorithms for text categorization

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
Accelerating Biomedical Data-Intensive Applications Using MapReduce

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
pcApriori: scalable apriori for multiprocessor systems

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Efficient mining of frequent itemsets in social network data based on MapReduce framework

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.