Summarizing transactional databases with overlapped hyperrectangles

Authors:
Yang Xiang;Ruoming Jin;David Fuhry;Feodor F. Dragan
Affiliations:
Department of Biomedical Informatics, The Ohio State University, Columbus, USA 43210;Department of Computer Science, Kent State University, Kent, USA 44242;Department of Computer Science and Engineering, The Ohio State University, Columbus, USA 43210;Department of Computer Science, Kent State University, Kent, USA 44242
Venue:
Data Mining and Knowledge Discovery
Year:
2011

Citing 32
Cited 2

Efficient management of transitive relationships in large data and knowledge bases

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast discovery of association rules

Advances in knowledge discovery and data mining
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
The maximum edge biclique problem is NP-complete

Discrete Applied Mathematics
Approximating a collection of frequent sets

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Support envelopes: a technique for exploring the structure of association patterns

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Geometric and combinatorial tiles in 0-1 data

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
A general model for clustering binary data

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining compressed frequent-pattern sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Mining condensed frequent-pattern bases

Knowledge and Information Systems
MAFIA: A Maximal Frequent Itemset Algorithm

IEEE Transactions on Knowledge and Data Engineering
On efficiently summarizing categorical databases

Knowledge and Information Systems
Graph minimum linear arrangement by multilevel weighted edge contractions

Journal of Algorithms
Turning Clusters into Patterns: Rectangle-Based Discriminative Data Description

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Non-derivable itemset mining

Data Mining and Knowledge Discovery
On data mining, compression, and Kolmogorov complexity

Data Mining and Knowledge Discovery
The minimum consistent subset cover problem and its applications in data mining

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Characterising the difference

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarization – compressing data into an informative representation

Knowledge and Information Systems
The generalized MDL approach for summarization

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Compressing large boolean matrices using reordering techniques

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Overlapping Matrix Pattern Visualization: A Hypergraph Approach

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Cartesian contour: a concise representation for a collection of frequent sets

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining bi-sets in numerical data

KDID'06 Proceedings of the 5th international conference on Knowledge discovery in inductive databases
Modern Coding Theory

Modern Coding Theory
Compression picks item sets that matter

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Comparing apples and oranges: measuring differences between data mining results

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Data summarization for network traffic monitoring

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transactional data are ubiquitous. Several methods, including frequent itemset mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles (a concept generally regarded as tile according to some existing papers); we then prove that this problem and its several variations are NP-hard, and we further reveal its relationship with the compact representation of a directed bipartite graph. We develop an approximation algorithm Hyper which can achieve a logarithmic approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm, and we also propose an efficient algorithm Hyper+ to further summarize the set of hyperrectangles by allowing false positive conditions. Additionally, we show that hyperrectangles generated by our algorithms can be properly visualized. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.