Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Authors:
Gabriela Moise;Jörg Sander
Affiliations:
University of Alberta, Edmonton, AB, Canada;University of Alberta, Edmonton, AB, Canada
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 17
Cited 14

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Bump hunting in high-dimensional data

Statistics and Computing
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
HARP: A Practical Projected Clustering Algorithm

IEEE Transactions on Knowledge and Data Engineering
SCHISM: A New Approach for Interesting Subspace Mining

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Iterative Projected Clustering by Subspace Mining

IEEE Transactions on Knowledge and Data Engineering
Projective Clustering by Histograms

IEEE Transactions on Knowledge and Data Engineering
On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Spatial scan statistics: approximations and performance study

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
P3C: A Robust Projected Clustering Algorithm

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
DUSC: Dimensionality Unbiased Subspace Clustering

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Detection and visualization of subspace cluster hierarchies

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications

Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Detection of orthogonal concepts in subspaces of high dimensional data

Proceedings of the 18th ACM conference on Information and knowledge management
Subspace and projected clustering: experimental evaluation and analysis

Knowledge and Information Systems
Evaluating clustering in subspace projections of high dimensional data

Proceedings of the VLDB Endowment
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
INCONCO: interpretable clustering of numerical and categorical objects

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of the PMML standard to subspace clustering models

Proceedings of the 2011 workshop on Predictive markup language modeling
Scalable density-based subspace clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
External evaluation measures for subspace clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery
Mining order-preserving submatrices from probabilistic matrices

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Projected and subspace clustering algorithms search for clusters of points in subsets of attributes. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes. Subspace clustering enumerates clusters of points in all subsets of attributes, typically producing many overlapping clusters. One problem of existing approaches is that their objectives are stated in a way that is not independent of the particular algorithm proposed to detect such clusters. A second problem is the definition of cluster density based on user-defined parameters, which makes it hard to assess whether the reported clusters are an artifact of the algorithm or whether they actually stand out in the data in a statistical sense. We propose a novel problem formulation that aims at extracting axis-parallel regions that stand out in the data in a statistical sense. The set of axis-parallel, statistically significant regions that exist in a given data set is typically highly redundant. Therefore, we formulate the problem of representing this set through a reduced, non-redundant set of axis-parallel, statistically significant regions as an optimization problem. Exhaustive search is not a viable solution due to computational infeasibility, and we propose the approximation algorithm STATPC. Our comprehensive experimental evaluation shows that STATPC significantly outperforms existing projected and subspace clustering algorithms in terms of accuracy.