Improving generalization by data categorization

Authors:
Ling Li;Amrit Pratap;Hsuan-Tien Lin;Yaser S. Abu-Mostafa
Affiliations:
Learning Systems Group, California Institute of Technology;Learning Systems Group, California Institute of Technology;Learning Systems Group, California Institute of Technology;Learning Systems Group, California Institute of Technology
Venue:
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2005

Citing 4
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Discovering informative patterns and data cleaning

Advances in knowledge discovery and data mining
Generalization error estimates and training data valuation

Generalization error estimates and training data valuation
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review

New results on observations selection

SMO'07 Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization
Application of MOGA Search Strategy to SVM Training Data Selection

EMO '09 Proceedings of the 5th International Conference on Evolutionary Multi-Criterion Optimization
Query sampling for learning data fusion

Proceedings of the 20th ACM international conference on Information and knowledge management
Recognizing clothes patterns for blind people by confidence margin based feature combination

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Semisupervised profiling of gene expressions and clinical data

WILF'05 Proceedings of the 6th international conference on Fuzzy Logic and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.