On the utility of partially labeled data for classification of microarray data

  • Authors:
  • Ludwig Lausser;Florian Schmid;Hans A. Kestler

  • Affiliations:
  • Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany;Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany;Research Group Bioinformatics and Systems Biology Institute of Neural Information Processing, University of Ulm, Germany

  • Venue:
  • PSL'11 Proceedings of the First IAPR TC3 conference on Partially Supervised Learning
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Microarrays are standard tools for measuring thousands of gene expression levels simultaneously. They are frequently used in the classification process of tumor tissues. In this setting a collected set of samples often consists only of a few dozen data points. Common approaches for classifying such data are supervised. They exclusively use categorized data for training a classification model. Restricted to a small number of samples, these algorithms are affected by overfitting and often lack a good generalization performance. An implicit assumption of supervised methods is that only labeled training samples exist. This assumption does not always hold. In medical studies often additional unlabeled samples are available that can not be categorized for some time (i.e., "early relapse" vs. "late relapse"). Alternative classification approaches, such as semi-supervised or transductive algorithms, are able to utilize this partially labeled data. Here, we empirically investigate five semi-supervised and transductive algorithms as "early prediction tools" for incompletely labeled datasets of high dimensionality and low cardinality. Our experimental setup consists of cross-validation experiments under varying ratios of labeled to unlabeled examples. Most interestingly, the best cross-validation performance is not always achieved for completely labeled data, but rather for partially labeled datasets indicating the strong influence of label information on the classification process, even in the linearly separable case.