Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Information Sciences: an International Journal
On the effective implementation of the iterative proportional fitting procedure
Computational Statistics & Data Analysis - Special issue dedicated to Toma´sˇ Havra´nek
A maximum entropy approach to natural language processing
Computational Linguistics
Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical methods for speech recognition
Statistical methods for speech recognition
Prediction with local patterns using cross-entropy
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Bucket elimination: a unifying framework for probabilistic inference
UAI'96 Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence
Boolean formulas and frequent sets
Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases
A Framework for Synthesizing Arbitrary Boolean Queries Induced by Frequent Itemsets
International Journal of Knowledge-Based Organizations
Hi-index | 0.00 |
Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.