A semidiscrete matrix decomposition for latent semantic indexing information retrieval
ACM Transactions on Information Systems (TOIS)
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Clustering in large graphs and matrices
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
ROCK: a robust clustering algorithm for categorical attributes
Information Systems
Algorithm 805: computation and uses of the semidiscrete matrix decomposition
ACM Transactions on Mathematical Software (TOMS)
Clustering categorical data: an approach based on dynamical systems
The VLDB Journal — The International Journal on Very Large Data Bases
Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization. We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.