Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

Authors:
Tansel Özyer;Reda Alhajj
Affiliations:
Department of Computer Engineering, TOBB ETU Economics and Technology University, Ankara, Turkey 06560;Department of Computer Science, University of Calgary, Calgary, Canada
Venue:
Applied Intelligence
Year:
2009

Citing 26
Cited 9

The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Clustering Algorithms

Clustering Algorithms
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
BANG-Clustering: A Novel Grid-Clustering Algorithm for Huge Data Sets

SSPR '98/SPR '98 Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
FGKA: a Fast Genetic K-means Clustering Algorithm

Proceedings of the 2004 ACM symposium on Applied computing
Automated Variable Weighting in k-Means Type Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
An overview of evolutionary algorithms in multiobjective optimization

Evolutionary Computation
Scalability problems of simple genetic algorithms

Evolutionary Computation
Multi-objective genetic algorithm based clustering approach and its application to gene expression data

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Clustering by integrating multi-objective optimization with weighted k-means and validity analysis

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Clustering with a genetically optimized approach

IEEE Transactions on Evolutionary Computation
A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation

On combining multiple clusterings: an overview and a new perspective

Applied Intelligence
A review: accuracy optimization in clustering ensembles using genetic algorithms

Artificial Intelligence Review
Integrating multi-objective genetic algorithm based clustering and data partitioning for skyline computation

Applied Intelligence
From alternative clustering to robust clustering and its application to gene expression data

IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning
A two-leveled symbiotic evolutionary algorithm for clustering problems

Applied Intelligence
Designing heterogeneous distributed GAs by efficiently self-adapting the migration period

Applied Intelligence
Dynamic clustering using combinatorial particle swarm optimization

Applied Intelligence
Reporting and analyzing alternative clustering solutions by employing multi-objective genetic algorithm and conducting experiments on cancer data

Knowledge-Based Systems
Statistical user model supported by R-Tree structure

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper applies divide and conquer approach in an iterative way to handle the clustering process. The target is a parallelized effective and efficient approach that produces the intended clustering result. We achieve scalability by first partitioning a large dataset into subsets of manageable sizes based on the specifications of the machine to be used in the clustering process; then cluster the partitions separately in parallel. The centroid of each obtained cluster is treated like the root of a tree with instances in its cluster as leaves. The partitioning and clustering process is iteratively applied on the centroids with the trees growing up until we get the final clustering; the outcome is a forest with one tree per cluster. Finally, a conquer process is performed to get the actual intended clustering, where each instance (leaf node) belongs to the final cluster represented by the root of its tree. We use multi-objective genetic algorithm combined with validity indices to decide on the number of classes. This approach fits well for interactive online clustering. It facilitates for incremental clustering because chunks of instances are clustered as stand alone sets, and then the results are merged with existing clusters. This is attractive and feasible because we consider the clustering of only centroids after the first clustering stage. The reported test results demonstrate the applicability and effectiveness of the proposed approach.