SAIL: summation-based incremental learning for information-theoretic clustering

Authors:
Junjie Wu;Hui Xiong;Jian Chen
Affiliations:
Beihang University, Beijing, China;Rutgers, the State University of New Jersy, Newark, NJ, USA;Tsinghua University, Beijing, China
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 10
Cited 1

OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Clustering with Bregman Divergences

The Journal of Machine Learning Research
A Generalization of Proximity Functions for K-Means

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining

GIS enabled service site selection: Environmental analysis and beyond

Information Systems Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information-theoretic clustering aims to exploit information theoretic measures as the clustering criteria. A common practice on this topic is so-called INFO-K-means, which performs K-means clustering with the KL-divergence as the proximity function. While expert efforts on INFO-K-means have shown promising results, a remaining challenge is to deal with high-dimensional sparse data. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional sparse data. This leads to infinite KL-divergence values, which create a dilemma in assigning objects to the centroids during the iteration process of K-means. To meet this dilemma, in this paper, we propose a Summation-based Incremental Learning (SAIL) method for INFO-K-means clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of the KL-divergence by the computation of the Shannon entropy. This can avoid the zero-value dilemma caused by the use of the KL-divergence. Our experimental results on various real-world document data sets have shown that, with SAIL as a booster, the clustering performance of K-means can be significantly improved. Also, SAIL leads to quick convergence and a robust clustering performance on high-dimensional sparse data.