Concept Decompositions for Large Sparse Text Data Using Clustering

Authors:
Inderjit S. Dhillon;Dharmendra S. Modha
Affiliations:
Department of Computer Science, University of Texas, Austin, TX 78712, USA. inderjit@cs.utexas.edu;IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA. dmodha@almaden.ibm.com
Venue:
Machine Learning
Year:
2001

Citing 22
Cited 18

Global convergence and empirical consistency of the generalized Lloyd algorithm

IEEE Transactions on Information Theory
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Clustering algorithms

Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
On the self-similar nature of Ethernet traffic (extended version)

IEEE/ACM Transactions on Networking (TON)
Using linear algebra for intelligent information retrieval

SIAM Review
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Almost-constant-time clustering of arbitrary corpus subsets4

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
SONIA: a service for organizing networked information autonomously

Proceedings of the third ACM conference on Digital libraries
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Restructuring sparse high dimensional data for effective retrieval

Proceedings of the 1998 conference on Advances in neural information processing systems II
Clustering Algorithms

Clustering Algorithms
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Microeconomic View of Data Mining

Data Mining and Knowledge Discovery
Model Selection in Unsupervised Learning with Applications To Document Clustering

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

A probabilistic model for latent semantic indexing in information retrieval and filtering

Computational information retrieval
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Feature Weighting in k-Means Clustering

Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Text Mining with Information-Theoretic Clustering

Computing in Science and Engineering
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A new approach to conceptual document indexing: building a hierarchical system of concepts based on document clusters

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
On scaling latent semantic indexing for large peer-to-peer systems

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An objective evaluation criterion for clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning spatially variant dissimilarity (SVaD) measures

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning spatially variant dissimilarity (SVaD) measures

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Generalized low rank approximations of matrices

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Clustered SVD strategies in latent semantic indexing

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.