Probabilistic models for query approximation with large sparse binary data sets

Authors:
Dmitry Pavlov;Heikki Mannila;Padhraic Smyth
Affiliations:
Information and Computer Science, University of California, Irvine, CA;Nokia Research Center, Finland;Information and Computer Science, University of California, Irvine, CA
Venue:
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Year:
2000

Citing 9
Cited 2

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
A unique formal system for binary decompositions of database relations, probability distributions, and graphs

Information Sciences: an International Journal
On the effective implementation of the iterative proportional fitting procedure

Computational Statistics & Data Analysis - Special issue dedicated to Toma´sˇ Havra´nek
A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical methods for speech recognition

Statistical methods for speech recognition
Prediction with local patterns using cross-entropy

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Bucket elimination: a unifying framework for probabilistic inference

UAI'96 Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence

Boolean formulas and frequent sets

Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases
A Framework for Synthesizing Arbitrary Boolean Queries Induced by Frequent Itemsets

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.