Quantifiable data mining using ratio rules

  • Authors:
  • Flip Korn;Alexandros Labrinidis;Yannis Kotidis;Christos Faloutsos

  • Affiliations:
  • AT&T Labs - Research, Florham Park, NJ 07932, USA/ E-mail: flip@research.att.com;University of Maryland, College Park, MD 20742, USA/ E-mail: {labrinid,kotidis}@cs.umd.edu;University of Maryland, College Park, MD 20742, USA/ E-mail: {labrinid,kotidis}@cs.umd.edu;Carnegie Mellon University, Pittsburgh, PA 15213, USA/ E-mail: christos@cs.cmu.edu

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Association Rule Mining algorithms operate on a data matrix (e.g., customers $\times$ products) to derive association rules [AIS93b, SA96]. We propose a new paradigm, namely, Ratio Rules, which are quantifiable in that we can measure the “goodness” of a set of discovered rules. We also propose the “guessing error” as a measure of the “goodness”, that is, the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. Another contribution is a novel method to guess missing/hidden values from the Ratio Rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can “guess” the amount spent on butter. Thus, unlike association rules, Ratio Rules can perform a variety of important tasks such as forecasting, answering “what-if” scenarios, detecting outliers, and visualizing the data. Moreover, we show that we can compute Ratio Rules in a single pass over the data set with small memory requirements (a few small matrices), in contrast to association rule mining methods which require multiple passes and/or large memory. Experiments on several real data sets (e.g., basketball and baseball statistics, biological data) demonstrate that the proposed method: (a) leads to rules that make sense; (b) can find large itemsets in binary matrices, even in the presence of noise; and (c) consistently achieves a “guessing error” of up to 5 times less than using straightforward column averages.