When and how to subsample: report on the KDD-2001 panel

Authors:
Pedro Domingos
Affiliations:
University of Washington, Seattle, WA
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2002

Citing 0
Cited 2

Lessons and Challenges from Mining Retail E-Commerce Data

Machine Learning
An optimization approach for feature selection in an electric billing database

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Databases in the terabyte range are now common. In many domains, mining all the data available in reasonable time is already beyond the reach of current systems. Yet the size of databases continues to grow rapidly. Is subsampling unavoidable? Or should it be avoided at all costs? If we subsample, what is the best way to do it? What issues must be taken into account? The KDD-2001 Panel on When and How to Subsample addressed these and related questions, with the twin goals of developing practical guidelines and identifying key research issues. It was chaired by Pedro Domingos (University of Washington), and the participants were Surajit Chaudhuri (Microsoft Research), David Jensen (University of Massachusetts at Amherst), Ronny Kohavi (Blue Martini), and Foster Provost (New York University). Below is each panelist's summary of his position.