Automatic Subspace Clustering of High Dimensional Data

Authors:
Rakesh Agrawal;Johannes Gehrke;Dimitrios Gunopulos;Prabhakar Raghavan
Affiliations:
IBM Almaden Research Center, San Jose 95120;Computer Science Department, Cornell University, Ithaca;Department of Computer Science and Eng., University of California Riverside, Riverside 92521;Verity, Inc., Germany
Venue:
Data Mining and Knowledge Discovery
Year:
2005

Citing 35
Cited 25

CSG set-theoretic solid modelling and NC machining of blend surfaces

SCG '86 Proceedings of the second annual symposium on Computational geometry
Covering a simple orthogonal polygon with a minimum number of orthogonally convex polygons

SCG '87 Proceedings of the third annual symposium on Computational geometry
Algorithms for clustering data

Algorithms for clustering data
Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles

SIAM Journal on Discrete Mathematics
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Minimum dissection of rectilinear polygon with arbitrary holes into rectangles

SCG '92 Proceedings of the eighth annual symposium on Computational geometry
On the hardness of approximating minimization problems

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Almost optimal set covers in finite VC-dimension: (preliminary version)

SCG '94 Proceedings of the tenth annual symposium on Computational geometry
Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A threshold of ln n for approximating set cover (preliminary version)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Range queries in OLAP data cubes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Association rules over interval data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Fast discovery of association rules

Advances in knowledge discovery and data mining
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Data mining, hypergraph transversals, and machine learning (extended abstract)

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A comparative study of clustering methods

Future Generation Computer Systems - Special double issue on data mining
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
An algorithm for constructing regions with rectangles: Independence and minimum generating sets for collections of intervals

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)

Hardness of approximate two-level logic minimization and PAC learning with membership queries

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
A clustering method to distribute a database on a grid

Future Generation Computer Systems
Compositional mining of multirelational biological datasets

ACM Transactions on Knowledge Discovery from Data (TKDD)
A general grid-clustering approach

Pattern Recognition Letters
Varying Density Spatial Clustering Based on a Hierarchical Tree

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Hardness of approximate two-level logic minimization and PAC learning with membership queries

Journal of Computer and System Sciences
Evaluating OpenMP 3.0 Run Time Systems on Unbalanced Task Graphs

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Subspace sums for extracting non-random data from massive noise

Knowledge and Information Systems
Data mining of vector–item patterns using neighborhood histograms

Knowledge and Information Systems
A new separation measure for improving the effectiveness of validity indices

Information Sciences: an International Journal
A semi-supervised clustering algorithm based on rough reduction

CCDC'09 Proceedings of the 21st annual international conference on Chinese control and decision conference
Enhancing principal direction divisive clustering

Pattern Recognition
Discovering Knowledge-Sharing Communities in Question-Answering Forums

ACM Transactions on Knowledge Discovery from Data (TKDD)
Private memoirs of a smart meter

Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building
Dampster-Shafer evidence theory based multi-characteristics fusion for clustering evaluation

RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
A multigrid method for the estimation of geometric anisotropy in environmental data from sensor networks

Computers & Geosciences
A clustering based approach for skyline diversity

Expert Systems with Applications: An International Journal
An entropy weighting mixture model for subspace clustering of high-dimensional data

Pattern Recognition Letters
Class description using partial coverage of subspaces

Expert Systems with Applications: An International Journal
OLAP over continuous domains via density-based hierarchical clustering

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
Discovering context-topic rules in search engine logs

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Feature interaction in subspace clustering using the Choquet integral

Pattern Recognition
Enhanced clustering of complex database objects in the clustcube framework

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
A weighting k-modes algorithm for subspace clustering of categorical data

Neurocomputing
TSum: fast, principled table summarization

Proceedings of the Seventh International Workshop on Data Mining for Online Advertising

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.