Probabilistic query models for transaction data

Authors:
Dmitry Pavlov;Padhraic Smyth
Affiliations:
Information and Computer Science, University of California, Irvine, Irvine, CA;Information and Computer Science, University of California, Irvine, Irvine, CA
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 17
Cited 4

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A maximum entropy approach to natural language processing

Computational Linguistics
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Bucket elimination: a unifying framework for probabilistic inference

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Prediction with local patterns using cross-entropy

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
Generalized queries on probabilistic context-free grammars

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
A Bayesian approach to learning Bayesian networks with local structure

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

Topics in 0--1 data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

IEEE Transactions on Knowledge and Data Engineering
Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Boolean formulas and frequent sets

Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the application of Bayesian networks, Markov random fields, and mixture models to the problem of query answering for transaction data sets. We formulate two versions of the querying problem: the query selectivity estimation (i.e., finding exact counts for tuples in a data set) and the query generalization problem (i.e., computing the probability that a tuple will occur in new data). We show that frequent itemsets are useful for reducing the original data to a compressed representation and introduce a method to store them using an ADTree data structure. In an extension of our earlier work on this topic we propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time. Experimental results on real-world transaction data sets provide insights into various tradeoffs involving the offline time for model-building, the online time for query-answering, the memory footprint of the compressed data, and the accuracy of the estimate provided to the query.