Informative sampling for large unbalanced data sets

  • Authors:
  • Zhenyu Lu;Anand I. Rughani;Bruce I. Tranmer;Josh Bongard

  • Affiliations:
  • University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA

  • Venue:
  • Proceedings of the 10th annual conference companion on Genetic and evolutionary computation
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Selective sampling is a form of active learning which can reduce the cost of training by only drawing informative data points into the training set. This selected training set is expected to contain more information for modeling compared to random sampling, thus making modeling faster and more accurate. We introduce a novel approach to selective sampling, which is derived from the Estimation-Exploration Algorithm (EEA). The EEA is a coevolutionary algorithm that uses model disagreement to determine the significance of a training datum, and evolves a set of models only on the selected data. The algorithm in this paper trains a population of Artificial Neural Networks (ANN) on the training set, and uses their disagreement to seek new data for the training set. A medical data set called the National Trauma Data Bank (NTDB) is used to test the algorithm. Experiments show that the algorithm outperforms the equivalent algorithm using randomly-selected data and sampling evenly from each class. Finally, the selected training data reveals which features most affect outcome, allowing for both improved modeling and understanding of the processes that gave rise to the data.