On the effectiveness and efficiency of computing bounds on the support of item-sets in the frequent item-sets mining problem

  • Authors:
  • Bassem Sayrafi;Dirk Van Gucht;Paul W. Purdom

  • Affiliations:
  • Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington, IN

  • Venue:
  • Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the relative effectiveness and the efficiency of computing support-bounding rules that can be used to prune the search space in algorithms to solve the frequent item-sets mining problem (FIM). We develop a formalism wherein these rules can be stated and analyzed using the concept of differentials and density functions of the support function. We derive a general bounding theorem, which provides lower and upper bounds on the supports of item-sets in terms of the supports of their subsets. Since, in general, many lower and upper bounds exists for the support of an item-set, we show how to the best bounds. The result of this optimization shows that the best bounds are among those that involve the supports of all the strict subsets of an item-set of a particular size q. These bounds are determined on the basis of so called q-rules. In this way, we derive the bounding theorem established by Calders [5]. For these types of bounds, we consider how they compare relative to each other, and in so doing determine the best bounds. Since determining these bounds is combinatorially expensive, we study heuristics that efficiently produce bounds that are usually the best. These heuristics always produce the best bounds on the support of item-sets for basket databases that satisfies independence properties. In particular, we show that for an item-set I determining which bounds to compute that lead to the best lower and upper bounds on freq(I) can be done in time O(|I|). Even though, in practice, basket databases do not have these independence properties, we argue that our analysis carries over to a much larger set of basket databases where local "near" independence hold. Finally, we conduct an experimental study using real baskets databases, where we compute upper bounds in the context of generalizing the Apriori algorithm. Both the analysis and the study confirm that the q-rule (q odd and larger than 1) will almost always do better than the 1-rule (Apriori rule) on large dense baskets databases. Our experiment reveal that on these baskets databases, the 3-rule prunes almost 100% of the search space while, the 1-rule prunes 96% of the search space in the early stages of the algorithm. We also observe a reduction in wasted effort when applying the 3-rule to sparse baskets databases. In addition, we give experimental evidence that the combined use of the lower and upper bounds determine the exact support of many frequent item-sets without counting.