Bump hunting in high-dimensional data

  • Authors:
  • Jerome H. Friedman;Nicholas I. Fisher

  • Affiliations:
  • Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 (jhf@stat.stanford.edu);CSIRO Mathematical & Information Sciences, Locked Bag 17, North Ryde, NSW 2113, Australia (Nick.Fisher@cmis.CSIRO.AU)

  • Venue:
  • Statistics and Computing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many data analytic questions can be formulated as (noisy) optimization problems. They explicitly or implicitly involve finding simultaneous combinations of values for a set of (’’input‘‘) variables that imply unusually large (or small) values of another designated (’’output‘‘) variable. Specifically, one seeks a set of subregions of the input variable space within which the value of the output variable is considerably larger (or smaller) than its average value over the entire input domain. In addition it is usually desired that these regions be describable in an interpretable form involving simple statements (’’rules‘‘) concerning the input values. This paper presents a procedure directed towards this goal based on the notion of ’’patient‘‘ rule induction. This patient strategy is contrasted with the greedy ones used by most rule induction methods, and semi-greedy ones used by some partitioning tree techniques such as CART. Applications involving scientific and commercial data bases are presented.