On the utility of partially labeled data for classification of microarray data

Authors:
Ludwig Lausser;Florian Schmid;Hans A. Kestler
Affiliations:
Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany;Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany;Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany
Venue:
PSL'11 Proceedings of the First IAPR TC3 conference on Partially Supervised Learning
Year:
2011

Citing 7
Cited 1

Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning from Labeled and Unlabeled Data using Graph Mincuts

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Kernel-Based Transductive Learning with Nearest Neighbors

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
A penalized likelihood based pattern classification algorithm

Pattern Recognition

Unlabeling data can improve classification accuracy

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Microarrays are standard tools for measuring thousands of gene expression levels simultaneously. They are frequently used in the classification process of tumor tissues. In this setting a collected set of samples often consists only of a few dozen data points. Common approaches for classifying such data are supervised. They exclusively use categorized data for training a classification model. Restricted to a small number of samples, these algorithms are affected by overfitting and often lack a good generalization performance. An implicit assumption of supervised methods is that only labeled training samples exist. This assumption does not always hold. In medical studies often additional unlabeled samples are available that can not be categorized for some time (i.e., "early relapse" vs. "late relapse"). Alternative classification approaches, such as semi-supervised or transductive algorithms, are able to utilize this partially labeled data. Here, we empirically investigate five semi-supervised and transductive algorithms as "early prediction tools" for incompletely labeled datasets of high dimensionality and low cardinality. Our experimental setup consists of cross-validation experiments under varying ratios of labeled to unlabeled examples. Most interestingly, the best cross-validation performance is not always achieved for completely labeled data, but rather for partially labeled datasets indicating the strong influence of label information on the classification process, even in the linearly separable case.