Informative sampling for large unbalanced data sets

Authors:
Zhenyu Lu;Anand I. Rughani;Bruce I. Tranmer;Josh Bongard
Affiliations:
University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA;University of Vermont, Burlington, VT, USA
Venue:
Proceedings of the 10th annual conference companion on Genetic and evolutionary computation
Year:
2008

Citing 15
Cited 1

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Unsupervised Feature Selection Using Feature Similarity

IEEE Transactions on Pattern Analysis and Machine Intelligence
Advanced Methods in Neural Computing

Advanced Methods in Neural Computing
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Selective Sampling Based on the Variation in Label Assignments

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Learning Deterministic Finite Automata with a Smart State Labeling Evolutionary Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introducing a Family of Linear Measures for Feature Selection in Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Active Coevolutionary Learning of Deterministic Finite Automata

The Journal of Machine Learning Research
Nonlinear System Identification Using Coevolution of Models and Tests

IEEE Transactions on Evolutionary Computation

Exploiting multiple classifier types with active learning

Proceedings of the 11th Annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Selective sampling is a form of active learning which can reduce the cost of training by only drawing informative data points into the training set. This selected training set is expected to contain more information for modeling compared to random sampling, thus making modeling faster and more accurate. We introduce a novel approach to selective sampling, which is derived from the Estimation-Exploration Algorithm (EEA). The EEA is a coevolutionary algorithm that uses model disagreement to determine the significance of a training datum, and evolves a set of models only on the selected data. The algorithm in this paper trains a population of Artificial Neural Networks (ANN) on the training set, and uses their disagreement to seek new data for the training set. A medical data set called the National Trauma Data Bank (NTDB) is used to test the algorithm. Experiments show that the algorithm outperforms the equivalent algorithm using randomly-selected data and sampling evenly from each class. Finally, the selected training data reveals which features most affect outcome, allowing for both improved modeling and understanding of the processes that gave rise to the data.