Modifying boosted trees to improve performance on task 1 of the 2006 KDD challenge cup

  • Authors:
  • Robert M. Bell;Patrick G. Haffner;Chris Volinsky

  • Affiliations:
  • AT&T Labs-Research, Florham Park, NJ;AT&T Labs-Research, Middletown, NJ;AT&T Labs-Research, Florham Park, NJ

  • Venue:
  • ACM SIGKDD Explorations Newsletter
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

Task 1 of the 2006 KDD Challenge Cup required classification of pulmonary embolisms (PEs) using variables derived from computed tomography angiography. We present our approach to the challenge and justification for our choices. We used boosted trees to perform the main classification task, but modified the algorithm to address idiosyncrasies of the scoring criteria. The two main modifications were: 1) changing the dependent variable in the training set to account for multiple PEs per patient, and 2) incorporating neighborhood information through augmentation of the set of predictor variables. Both of these resulted in measurable predictive improvement. In addition, we discuss a statistically based method for setting the classification threshold.