A measure-theoretic framework for constraints and bounds on measurements of data

  • Authors:
  • Dirk Van Gucht;Bassem Sayrafi

  • Affiliations:
  • Indiana University;Indiana University

  • Venue:
  • A measure-theoretic framework for constraints and bounds on measurements of data
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a mathematical framework for measures to study constraints and bounds on measurements of data used in the domains of databases and data mining. This framework is significant in that it facilitates the use of two tools for knowledge extraction, and paves the way for additional support of applications involving measurements in other computing areas such as the theory of reasoning about uncertainty in machine learning. The first tool permits the identification of structure present in data using measure constraints. This leads naturally to the consideration of implication problems for such constraints, which, in turn, leads to the introduction and consideration of inference rules systems and their associated soundness and completeness. We use this tool to study data dependencies in relational databases and to study disjunctive rules which are used in the determination of more compact representations of frequent itemsets in the frequent itemsets mining problem. We also make connections to propositional logic allowing us to prove completeness of a fragment of propositional logic and provide complexity results about the implication problem of measure constraints. The second tool consists of bounding rules that permit reasoning about bounds on measurements of data in terms of measures of related data. This leads to a general measure bounding theorem which is the fundamental tool for reasoning about these bounds. We apply this theorem to the frequent itemsets mining problem in data mining. In that context, we conduct a theoretical investigation of the bounding theorem assuming independence properties on the database and derive in the independent case the best bounds and show that these bounds can be computed efficiently. Then we report on an experimental evaluation to test the adequacy of these bounds on real world datasets. The key rule used in current frequent itemset mining algorithms, the anti-monotonicity rule, has been known to provide good bounds on the support of itemsets. We find that, for dense datasets, rules other than the anti-monotonicity rule provide tighter bounds on the support of itemsets. This leads to improved mining algorithms for this problem.