Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

  • Authors:
  • Dmitry Pavlov;Heikki Mannila;Padhraic Smyth

  • Affiliations:
  • -;-;-

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.