When and how to subsample: report on the KDD-2001 panel

  • Authors:
  • Pedro Domingos

  • Affiliations:
  • University of Washington, Seattle, WA

  • Venue:
  • ACM SIGKDD Explorations Newsletter
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Databases in the terabyte range are now common. In many domains, mining all the data available in reasonable time is already beyond the reach of current systems. Yet the size of databases continues to grow rapidly. Is subsampling unavoidable? Or should it be avoided at all costs? If we subsample, what is the best way to do it? What issues must be taken into account? The KDD-2001 Panel on When and How to Subsample addressed these and related questions, with the twin goals of developing practical guidelines and identifying key research issues. It was chaired by Pedro Domingos (University of Washington), and the participants were Surajit Chaudhuri (Microsoft Research), David Jensen (University of Massachusetts at Amherst), Ronny Kohavi (Blue Martini), and Foster Provost (New York University). Below is each panelist's summary of his position.