Bregman bubble clustering: A robust framework for mining dense clusters

Authors:
Gunjan Gupta;Joydeep Ghosh
Affiliations:
University of Texas at Austin, Austin, TX;University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2008

Citing 32
Cited 1

Algorithms for clustering data

Algorithms for clustering data
An improved spectral graph partitioning algorithm for mapping parallel computations

SIAM Journal on Scientific Computing
Multilevel hypergraph partitioning: application in VLSI domain

DAC '97 Proceedings of the 34th annual Design Automation Conference
Deterministic annealing EM algorithm

Neural Networks
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
A view of the EM algorithm that justifies incremental, sparse, and other variants

Learning in graphical models
Concept decompositions for large sparse text data using clustering

Machine Learning
Mean Shift, Mode Seeking, and Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Amazon.com Recommendations: Item-to-Item Collaborative Filtering

IEEE Internet Computing
Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios

COMPSAC '01 Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development
DHC: A Density-Based Hierarchical Clustering Method for Time Series Gene Expression Data

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Relationship-Based Clustering and Visualization for High-Dimensional Data Mining

INFORMS Journal on Computing
Probabilistic discovery of overlapping cellular processes and their regulation

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Mining coherent gene clusters from gene-sample-time microarray data

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An objective evaluation criterion for clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A needle in a haystack: local one-class optimization

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Model-based overlapping clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Robust one-class clustering using hybrid global and local search

ICML '05 Proceedings of the 22nd international conference on Machine learning
Estimating the Support of a High-Dimensional Distribution

Neural Computation
2005 Special Issue: Efficient streaming text clustering

Neural Networks - 2005 Special issue: IJCNN 2005
Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Relational clustering by symmetric convex coding

Proceedings of the 24th international conference on Machine learning
Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An information-theoretic analysis of hard and soft assignment methods for clustering

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence
Scale-based clustering using the radial basis function network

IEEE Transactions on Neural Networks

Isolating top-k dense regions with filtration of sparse background

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain nonparametric density-based clustering methods find the most relevant data as multiple dense regions, but such methods are generally limited to low-dimensional data and do not scale well to large, high-dimensional datasets. Also, they use a specific notion of “distance”, typically Euclidean or Mahalanobis distance, which further limits their applicability. On the other hand, the recent One Class Information Bottleneck (OC-IB) method is fast and works on a large class of distortion measures known as Bregman Divergences, but can only find a single dense region. This article presents a broad framework for finding k dense clusters while ignoring the rest of the data. It includes a seeding algorithm that can automatically determine a suitable value for k. When k is forced to 1, our method gives rise to an improved version of OC-IB with optimality guarantees. We provide a generative model that yields the proposed iterative algorithm for finding k dense regions as a special case. Our analysis reveals an interesting and novel connection between the problem of finding dense regions and exponential mixture models; a hard model corresponding to k exponential mixtures with a uniform background results in a set of k dense clusters. The proposed method describes a highly scalable algorithm for finding multiple dense regions that works with any Bregman Divergence, thus extending density based clustering to a variety of non-Euclidean problems not addressable by earlier methods. We present empirical results on three artificial, two microarray and one text dataset to show the relevance and effectiveness of our methods.