Unlabeling data can improve classification accuracy

Authors:
Ludwig Lausser;Florian Schmid;Matthias Schmid;Hans A. Kestler
Affiliations:
-;-;-;-
Venue:
Pattern Recognition Letters
Year:
2014

Citing 15
Cited 0

On the exponential value of labeled samples

Pattern Recognition Letters
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning from Labeled and Unlabeled Data using Graph Mincuts

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to Human-Computer Interaction

IEEE Transactions on Pattern Analysis and Machine Intelligence
SVM-HUSTLE—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Bioinformatics
Kernel-Based Transductive Learning with Nearest Neighbors

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
A penalized likelihood based pattern classification algorithm

Pattern Recognition
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Constrained parameter estimation for semi-supervised learning: the case of the nearest mean classifier

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Robustness analysis of eleven linear classifiers in extremely high–dimensional feature spaces

ANNPR'10 Proceedings of the 4th IAPR TC3 conference on Artificial Neural Networks in Pattern Recognition
On the utility of partially labeled data for classification of microarray data

PSL'11 Proceedings of the First IAPR TC3 conference on Partially Supervised Learning
The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter

IEEE Transactions on Information Theory - Part 2
Probability of error of some adaptive pattern-recognition machines

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.10

Visualization

Abstract

In this study we focus on the effects of sample limitations on partially supervised learning algorithms. We analyze the performance of these types of learning algorithms on small datasets under varying trade-offs between labeled and unlabeled samples. In contrast to the typical settings for partially supervised learning algorithms, the number of available unlabeled samples is also restricted. We utilize gene expression datasets, which are typical examples of data collections of small sample size. DNA microarrays are used to generate these profiles by measuring thousands of mRNA values simultaneously. These profiles are increasingly used for tumor categorization. Partially labeled microarray datasets occur naturally in the diagnostic setting if the corresponding labeling process is time consuming or expensive (i.e., ''early relapse'' vs. ''late relapse''). Surprisingly, the best classification results in our study were not always achieved for a maximal proportion of labeled samples. This is unexpected as asymptotical results for an unlimited amount of samples suggest that a labeled sample is of an exponentially higher value than an unlabeled one. Our analysis shows that in the case of finite sample sizes a more balanced trade-off between labeled and unlabeled samples is optimal. This trade-off was not unique over all experiments. It could be shown that the optimal trade-off between unlabeled and labeled samples is mainly dependent on the chosen learning algorithm.