Diagnosing extrapolation: tree-based density estimation

  • Authors:
  • Giles Hooker

  • Affiliations:
  • Stanford University, Stanford, CA

  • Venue:
  • Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has historically been very little concern with extrapolation in Machine Learning, yet extrapolation can be critical to diagnose. Predictor functions are almost always learned on a set of highly correlated data comprising a very small segment of predictor space. Moreover, flexible predictors, by their very nature, are not controlled at points of extrapolation. This becomes a problem for diagnostic tools that require evaluation on a product distribution. It is also an issue when we are trying to optimize a response over some variable in the input space. Finally, it can be a problem in non-static systems in which the underlying predictor distribution gradually drifts with time or when typographical errors misrecord the values of some predictors.We present a diagnosis for extrapolation as a statistical test for a point originating from the data distribution as opposed to a null hypothesis uniform distribution. This allows us to employ general classification methods for estimating such a test statistic. Further, we observe that CART can be modified to accept an exact distribution as an argument, providing a better classification tool which becomes our extrapolation-detection procedure. We explore some of the advantages of this approach and present examples of its practical application.