Equi-depth multidimensional histograms
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Introduction to algorithms
Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A maximum entropy approach to natural language processing
Computational Linguistics
Fast discovery of association rules
Advances in knowledge discovery and data mining
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Aqua approximate query answering system
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Prediction with local patterns using cross-entropy
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Independence is good: dependency-based histogram synopses for high-dimensional data
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Probabilistic query models for transaction data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Query Processing Using Wavelets
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Summary Structures for Frequency Queries on Large Transaction Sets
DCC '00 Proceedings of the Conference on Data Compression
Learning with mixtures of trees
Learning with mixtures of trees
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Cached sufficient statistics for efficient machine learning with large datasets
Journal of Artificial Intelligence Research
Scalability and efficiency in multi-relational data mining
ACM SIGKDD Explorations Newsletter
Tractable learning of large Bayes net structures from sparse data
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Summarizing itemset patterns: a profile-based approach
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Bayes net graphs to understand co-authorship networks?
Proceedings of the 3rd international workshop on Link discovery
Computational complexity of queries based on itemsets
Information Processing Letters
Summarizing itemset patterns using probabilistic models
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Probabilistic graphical models and their role in databases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Itemset frequency satisfiability: Complexity and axiomatization
Theoretical Computer Science
New probabilistic interest measures for association rules
Intelligent Data Analysis
Effective and efficient itemset pattern summarization: regression-based approaches
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximum entropy based significance of itemsets
Knowledge and Information Systems
Capturing truthiness: mining truth tables in binary datasets
Proceedings of the 2009 ACM symposium on Applied Computing
Computational complexity of queries based on itemsets
Information Processing Letters
Learning approximate MRFs from large transactional data
ICML'06 Proceedings of the 2006 conference on Statistical network analysis
Compact and understandable descriptions of mixtures of Bernoulli distributions
IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic self-organizing maps for qualitative data
Neural Networks
Maximum entropy models and subjective interestingness: an application to tiles in binary databases
Data Mining and Knowledge Discovery
Implicit enumeration of patterns
KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Learning approximate MRFs from large transaction data
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Itemset support queries using frequent itemsets and their condensed representations
DS'06 Proceedings of the 9th international conference on Discovery Science
Summarizing frequent patterns using profiles
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Processing count queries over event streams at multiple time granularities
Information Sciences: an International Journal
KDID'05 Proceedings of the 4th international conference on Knowledge Discovery in Inductive Databases
Hi-index | 0.00 |
We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.