Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Authors:
Mehmet Koyutürk;Ananth Grama;Naren Ramakrishnan
Affiliations:
-;-;-
Venue:
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2002

Citing 7
Cited 4

Using linear algebra for intelligent information retrieval

SIAM Review
A semidiscrete matrix decomposition for latent semantic indexing information retrieval

ACM Transactions on Information Systems (TOIS)
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Clustering in large graphs and matrices

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Algorithm 805: computation and uses of the semidiscrete matrix decomposition

ACM Transactions on Mathematical Software (TOMS)
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization. We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.