Probabilistic model-based clustering of complex data

Authors:
Shi Zhong;Joydeep Ghosh
Affiliations:
-;-
Venue:
Probabilistic model-based clustering of complex data
Year:
2003

Citing 0
Cited 9

Short communication: Variable space hidden Markov model for topic detection and analysis

Knowledge-Based Systems
A framework for WWW user activity analysis based on user interest

Knowledge-Based Systems
A two-stage mechanism for registration and classification of ECG using Gaussian mixture model

Pattern Recognition
A new distance measure for hidden Markov models

Expert Systems with Applications: An International Journal
Multi-grain hierarchical topic extraction algorithm for text mining

Expert Systems with Applications: An International Journal
Semantic multi-grain mixture topic model for text analysis

Expert Systems with Applications: An International Journal
ESPClust: an effective skew prevention method for model-based document clustering

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Topics modeling based on selective Zipf distribution

Expert Systems with Applications: An International Journal
MMPClust: a skew prevention algorithm for model-based document clustering

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In many emerging data mining applications, one needs to cluster complex data such as very high-dimensional sparse text documents and continuous or discrete time sequences. Probabilistic model-based clustering techniques have shown promising results in many such applications. For real-valued low-dimensional vector data, Gaussian models have been frequently used. For very high-dimensional vector and non-vector data, model-based clustering is a natural choice when it is difficult to extract good features or identify an appropriate measure of similarity between pairs of data objects. This dissertation presents a unified framework for model-based clustering based on a bipartite graph view of data and models. The framework includes an information-theoretic analysis of model-based partitional clustering from a deter ministic annealing point of view and a view of model-based hierarchical clustering that leads to several useful extensions. The framework is used to develop two new variations of model-based clustering—a balanced model-based partitional clustering algorithm that produces clusters of comparable sizes and a hybrid model-based clustering approach that combines the advantages of partitional and hierarchical model-based algorithms. I apply the framework and new clustering algorithms to cluster several distinct types of complex data, ranging from arbitrary-shaped 2-D synthetic data to high dimensional documents, EEG time series, and gene expression time sequences. The empirical results demonstrate the usefulness of the scalable, balanced model-based clustering algorithms, as well as the benefits of the hybrid model-based clustering approach. They also showcase the generality of the proposed clustering framework.