Global convergence and empirical consistency of the generalized Lloyd algorithm
IEEE Transactions on Information Theory
Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
On the self-similar nature of Ethernet traffic (extended version)
IEEE/ACM Transactions on Networking (TON)
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Almost-constant-time clustering of arbitrary corpus subsets4
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Latent semantic indexing: a probabilistic analysis
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
SONIA: a service for organizing networked information autonomously
Proceedings of the third ACM conference on Digital libraries
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Restructuring sparse high dimensional data for effective retrieval
Proceedings of the 1998 conference on Advances in neural information processing systems II
Clustering Algorithms
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A Microeconomic View of Data Mining
Data Mining and Knowledge Discovery
Model Selection in Unsupervised Learning with Applications To Document Clustering
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A probabilistic model for latent semantic indexing in information retrieval and filtering
Computational information retrieval
Enhanced word clustering for hierarchical text classification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Feature Weighting in k-Means Clustering
Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Text Mining with Information-Theoretic Clustering
Computing in Science and Engineering
Generative model-based clustering of directional data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
On scaling latent semantic indexing for large peer-to-peer systems
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering by concept factorization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic framework for semi-supervised clustering
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An objective evaluation criterion for clustering
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning spatially variant dissimilarity (SVaD) measures
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning spatially variant dissimilarity (SVaD) measures
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Generalized low rank approximations of matrices
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Clustered SVD strategies in latent semantic indexing
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.