Leakage in data mining: formulation, detection, and avoidance

Authors:
Shachar Kaufman;Saharon Rosset;Claudia Perlich
Affiliations:
Tel-Aviv University, Tel-Aviv, Israel;Tel-Aviv University, Tel-Aviv, Israel;Media6Degrees, New York, NY, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 10
Cited 2

Learning in the presence of concept drift and hidden contexts

Machine Learning
Data preparation for data mining

Data preparation for data mining
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Business Modeling and Data Mining

Business Modeling and Data Mining
Lessons and Challenges from Mining Retail E-Commerce Data

Machine Learning
Making the most of your data: KDD Cup 2007 "How Many Ratings" winner's report

ACM SIGKDD Explorations Newsletter - Special issue on visual analytics
Breast cancer identification: KDD CUP winner's report

ACM SIGKDD Explorations Newsletter
Handbook of Statistical Analysis and Data Mining Applications

Handbook of Statistical Analysis and Data Mining Applications
Medical data mining: insights from winning two competitions

Data Mining and Knowledge Discovery
Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach

Statistical Analysis and Data Mining

Fairness-Aware classifier with prejudice remover regularizer

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Disinformation techniques for entity resolution

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.