Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets

Authors:
Mehmet Koyuturk;Ananth Grama;Naren Ramakrishnan
Affiliations:
-;-;IEEE Computer Society
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 17
Cited 12

Using linear algebra for intelligent information retrieval

SIAM Review
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Algorithm 805: computation and uses of the semidiscrete matrix decomposition

ACM Transactions on Mathematical Software (TOMS)
Algorithm 457: finding all cliques of an undirected graph

Communications of the ACM
The Centroid Decomposition: Relationships between Discrete Variational Decompositions and SVDs

SIAM Journal on Matrix Analysis and Applications
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The maximum edge biclique problem is NP-complete

Discrete Applied Mathematics
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering

Data Mining and Knowledge Discovery

Projective clustering using itemset discovery for multi-dimensional data analysis

MS'06 Proceedings of the 17th IASTED international conference on Modelling and simulation
Mining association rules through integration of clustering analysis and ant colony system for health insurance database in Taiwan

Expert Systems with Applications: An International Journal
Semantic indexing in structured peer-to-peer networks

Journal of Parallel and Distributed Computing
An approach to mining bundled commodities

Knowledge-Based Systems
Interactive mining of frequent itemsets over arbitrary time intervals in a data stream

ADC '08 Proceedings of the nineteenth conference on Australasian database - Volume 75
Mining images using clustering and data compressing techniques

International Journal of Information and Communication Technology
Mining discrete patterns via binary matrix factorization

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying the components

Data Mining and Knowledge Discovery
The discrete basis problem

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Variable support mining of frequent itemsets over data streams using synopsis vectors

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Clustering of heterogeneously typed data with soft computing - a case study

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Fast parameterless density-based clustering via random projections

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.