NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases

Authors:
Ioannis A. Sarafis;Phil W. Trinder;Ali M. S. Zalzala
Affiliations:
School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom;School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom;Technology and Research Solutions FZ-LIC, P.O. Box 500735, Dubai, United Arab Emirates
Venue:
Applied Soft Computing
Year:
2007

Citing 31
Cited 4

Algorithms for clustering data

Algorithms for clustering data
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
A variable-length genetic algorithm for clustering and classification

Pattern Recognition Letters - Special issue on genetic algorithms
Genetic algorithms + data structures = evolution programs (3rd ed.)

Genetic algorithms + data structures = evolution programs (3rd ed.)
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Computational geometry: algorithms and applications

Computational geometry: algorithms and applications
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
Genetic Algorithms and Grouping Problems

Genetic Algorithms and Grouping Problems
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Mining Very Large Databases with Parallel Processing

Mining Very Large Databases with Parallel Processing
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Mining and Knowledge Discovery with Evolutionary Algorithms

Data Mining and Knowledge Discovery with Evolutionary Algorithms
Information Theory: Coding Theorems for Discrete Memoryless Systems

Information Theory: Coding Theorems for Discrete Memoryless Systems
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
WaveCluster: a wavelet-based clustering approach for spatial data in very large databases

The VLDB Journal — The International Journal on Very Large Data Bases
O-Cluster: Scalable Clustering of Large High Dimensional Data Sets

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Discovering patterns in spatial data using evolutionary programming

GECCO '96 Proceedings of the 1st annual conference on Genetic and evolutionary computation
Mining comprehensible clustering rules with an evolutionary algorithm

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII
Hybrid genetic algorithms are better for spatial clustering

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Clustering with a genetically optimized approach

IEEE Transactions on Evolutionary Computation
Genetic K-means algorithm

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

A survey of evolutionary algorithms for clustering

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Clustering with XCS and agglomerative rule merging

IDEAL'09 Proceedings of the 10th international conference on Intelligent data engineering and automated learning
Efficiency issues of evolutionary k-means

Applied Soft Computing
Immunodomaince based Clonal Selection Clustering Algorithm

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a descriptive data mining task aiming to group the data into homogeneous groups. This paper presents a novel evolutionary algorithm (NOCEA) that efficiently and effectively clusters massive numerical databases. NOCEA evolves individuals of variable-length consisting of disjoint and axis-aligned hyper-rectangular rules with homogeneous data distribution. The antecedent part of the rules includes an interval-like condition for each dimension. A novel quantisation algorithm imposes a regular multi-dimensional grid structure onto the data space to reduce the search combinations. Due to quantisation the boundaries of the intervals are encoded as integer values. The evolutionary search is guided by a simple data coverage maximisation function. The enormous data space is effectively explored by task-specific recombination and mutation operators producing candidate solutions with no overlapping rules. A parsimony generalisation operator shortens the discovered knowledge by replacing adjacent rules with more generic ones. NOCEA employs a special homogeneity operator that enforces quasi-uniform data distribution in the space enclosed by the candidate rules. After convergence the discovered knowledge undergoes simplification to perform subspace clustering, and to assemble the clusters. Results using real-world datasets are included to show that NOCEA has several attractive properties for clustering including: (a) comprehensible output in the form of disjoint and homogeneous rules, (b) the ability to discover clusters of arbitrary shape, density, size, and data coverage, (c) ability to perform effective subspace clustering, (d) near linear scalability with the database size, data and cluster dimensionality, and (e) substantial potential for task parallelism (speedup of 13.8 on 16 processors). A real-world example is a detailed study of the seismicity along the African-Eurasian-Arabian plate boundaries.