A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes
Information Systems
Algorithm 805: computation and uses of the semidiscrete matrix decomposition
ACM Transactions on Mathematical Software (TOMS)
Algorithm 457: finding all cliques of an undirected graph
Communications of the ACM
The Centroid Decomposition: Relationships between Discrete Variational Decompositions and SVDs
SIAM Journal on Matrix Analysis and Applications
Principal Direction Divisive Partitioning
Data Mining and Knowledge Discovery
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems
The VLDB Journal — The International Journal on Very Large Data Bases
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The maximum edge biclique problem is NP-complete
Discrete Applied Mathematics
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering
Data Mining and Knowledge Discovery
Projective clustering using itemset discovery for multi-dimensional data analysis
MS'06 Proceedings of the 17th IASTED international conference on Modelling and simulation
Expert Systems with Applications: An International Journal
Semantic indexing in structured peer-to-peer networks
Journal of Parallel and Distributed Computing
An approach to mining bundled commodities
Knowledge-Based Systems
Interactive mining of frequent itemsets over arbitrary time intervals in a data stream
ADC '08 Proceedings of the nineteenth conference on Australasian database - Volume 75
Mining images using clustering and data compressing techniques
International Journal of Information and Communication Technology
Mining discrete patterns via binary matrix factorization
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining and Knowledge Discovery
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Variable support mining of frequent itemsets over data streams using synopsis vectors
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Clustering of heterogeneously typed data with soft computing - a case study
MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Fast parameterless density-based clustering via random projections
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.