Identifying rare classes with sparse training data

Authors:
Mingwu Zhang;Wei Jiang;Chris Clifton;Sunil Prabhakar
Affiliations:
Department of Computer Science, Purdue University, West Lafayette, IN;Department of Computer Science, Purdue University, West Lafayette, IN;Department of Computer Science, Purdue University, West Lafayette, IN;Department of Computer Science, Purdue University, West Lafayette, IN
Venue:
DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Year:
2007

Citing 7
Cited 0

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
C4.5: programs for machine learning

C4.5: programs for machine learning
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier

Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology
Integrating constraints and metric learning in semi-supervised clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
SVMC: single-class classification with support vector machines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building models and learning patterns from a collection of data are essential tasks for decision making and dissemination of knowledge. One of the common tools to extract knowledge is to build a classifier. However, when the training dataset is sparse, it is difficult to build an accurate classifier. This is especially true in biological science, as biological data are hard to produce and error-prone. Through empirical results, this paper shows challenges in building an accurate classifier with a sparse biological training dataset. Our findings indicate the inadequacies in well known classification techniques. Although certain clustering techniques, such as seeded k-Means, show some promise, there are still spaces for further improvement. In addition, we propose a novel idea that could be used to produce more balanced classifier when training data samples are very limited.