Empirical bayes screening for multi-item associations

Authors:
William DuMouchel;Daryl Pregibon
Affiliations:
AT&T Labs---Research, Florham Park, NJ;AT&T Labs---Research, Florham Park, NJ
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 6
Cited 38

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A new framework for itemset generation

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules

Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

Is pushing constraints deeply into the mining algorithms really what we want?: an alternative approach for association rule mining

ACM SIGKDD Explorations Newsletter
Local and Global Methods in Data Mining: Basic Techniques and Open Problems

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Shrinkage estimator generalizations of Proximal Support Vector Machines

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An iterative hypothesis-testing strategy for pattern discovery

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Screening and interpreting multi-item associations based on log-linear modeling

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Empirical Bayesian data mining for discovering patterns in post-marketing drug safety

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Graphical modeling based gene interaction analysis for microarray data

ACM SIGKDD Explorations Newsletter
Selecting the right objective measure for association analysis

Information Systems - Knowledge discovery and data mining (KDD 2002)
Interestingness of frequent itemsets using Bayesian networks as background knowledge

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases

IEEE Transactions on Knowledge and Data Engineering
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering Significant Patterns

Machine Learning
Mining statistically important equivalence classes and delta-discriminative emerging patterns

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic annotation of frequent patterns

ACM Transactions on Knowledge Discovery from Data (TKDD)
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Vote prediction by iterative domain knowledge and attribute elimination

International Journal of Business Intelligence and Data Mining
Statistical mining of interesting association rules

Statistics and Computing
New probabilistic interest measures for association rules

Intelligent Data Analysis
Removing biases in unsupervised learning of sequential patterns

Intelligent Data Analysis
Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Machine Learning
Volatile correlation computation: a checkpoint view

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximum entropy based significance of itemsets

Knowledge and Information Systems
Scalable pattern mining with Bayesian networks as background knowledge

Data Mining and Knowledge Discovery
An efficient rigorous approach for identifying statistically significant frequent itemsets

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Subspace sums for extracting non-random data from massive noise

Knowledge and Information Systems
Self-sufficient itemsets: An approach to screening potentially interesting associations between items

ACM Transactions on Knowledge Discovery from Data (TKDD)
Measure-driven keyword-query expansion

Proceedings of the VLDB Endowment
Privacy Preserving Categorical Data Analysis with Unknown Distortion Parameters

Transactions on Data Privacy
Using a reinforced concept lattice to incrementally mine association rules from closed itemsets

KDID'06 Proceedings of the 5th international conference on Knowledge discovery in inductive databases
Estimating rates of rare events with multiple hierarchies through scalable log-linear models

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
A log-linear approach to mining significant graph-relational patterns

Data & Knowledge Engineering
Temporal multi-hierarchy smoothing for estimating rates of rare events

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying potential adverse effects using the web: A new approach to medical hypothesis generation

Journal of Biomedical Informatics
Robust discovery of local patterns: subsets and stratification in adverse drug reaction surveillance

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Journal of the ACM (JACM)
Learning theory analysis for association rules and sequential event prediction

The Journal of Machine Learning Research
Interestingness measures for association rules within groups

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency computed as if items occurred independently. The focus is on obtaining reliable estimates of this measure of interestingness for all item sets, even item sets with relatively low frequencies. For example, in a medical database of patient histories, unusual item sets including the item "patient death" (or other serious adverse event) might hopefully be flagged with as few as 5 or 10 occurrences of the item set, it being unacceptable to require that item sets occur in as many as 0.1% of millions of patient reports before the data mining algorithm detects a signal. Similar considerations apply in fraud detection applications. Thus we abandon the requirement that interesting item sets must contain a relatively large fixed minimal support, and adopt a criterion based on the results of fitting an empirical Bayes model to the item set counts. The model allows us to define a 95% Bayesian lower confidence limit for the "interestingness" measure of every item set, whereupon the item sets can be ranked according to their empirical Bayes confidence limits. For item sets of size J 2, we also distinguish between multi-item associations that can be explained by the observed J(J-1)/2 pairwise associations, and item sets that are significantly more frequent than their pairwise associations would suggest. Such item sets can uncover complex or synergistic mechanisms generating multi-item associations. This methodology has been applied within the U.S. Food and Drug Administration (FDA) to databases of adverse drug reaction reports and within AT&T to customer international calling histories. We also present graphical techniques for exploring and understanding the modeling results.