Weighted Instance Typicality Search (WITS): A nearest neighbor data reduction algorithm

  • Authors:
  • Brent D. Morring;Tony R. Martinez

  • Affiliations:
  • Computer Science Department, Brigham Young University, Provo, UT 84602, USA. E-mail: morringb@axon.cs.byu.edu;E-mail: martinez@cs.byu.edu

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Two disadvantages of the standard nearest neighbor algorithm are1) it must store all the instances of the training set, thuscreating a large memory footprint and 2) it must search all theinstances of the training set to predict the classification of anew query point, thus it is slow at run time. Much work has beendone to remedy these shortcomings. This paper presents a newalgorithm WITS (Weighted-Instance Typicality Search) and a modifiedversion, Clustered-WITS (C-WITS), designed to address these issues.Data reduction algorithms address both issues by storing and usingonly a portion of the available instances. WITS is an incrementaldata reduction algorithm with O(n^2) complexity, where n is thetraining set size. WITS uses the concept of Typicality inconjunction with Instance-Weighting to produce minimal nearestneighbor solutions. WITS and C-WITS are compared to three otherstate of the art data reduction algorithms on ten real-worlddatasets. WITS achieved the highest average accuracy, showed fewercatastrophic failures, and stored an average of 71% fewer instancesthan DROP-5, the next most competitive algorithm in terms ofaccuracy and catastrophic failures. The C-WITS algorithm provides auser-defined parameter that gives the user control over thetraining-time vs. accuracy balance. This modification makes C-WITSmore suitable for large problems, the very problems data reductionsalgorithms are designed for. On two large problems (10,992 and20,000 instances), C-WITS stores only a small fraction of theinstances (0.88% and 1.95% of the training data)while maintaininggeneralization accuracies comparable to the best accuraciesreported for these problems.