Exploiting Unlabeled Data for Improving Accuracy of Predictive Data Mining

  • Authors:
  • Kang Peng;Slobodan Vucetic;Bo Han;Hongbo Xie;Zoran Obradovic

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Predictive data mining typically relies on labeled datawithout exploiting a much larger amount of availableunlabeled data. The goal of this paper is to show thatusing unlabeled data can be beneficial in a range ofimportant prediction problems and therefore should be anintegral part of the learning process. Given an unlabeleddataset representative of the underlying distribution and aK-class labeled sample that might be biased, ourapproach is to learn K contrast classifiers each trained todiscriminate a certain class of labeled data from theunlabeled population. We illustrate that contrastclassifiers can be useful in one-class classification, outlierdetection, density estimation, and learning from biaseddata. The advantages of the proposed approach aredemonstrated by an extensive evaluation on synthetic datafollowed by real-life bioinformatics applications for (1)ranking PubMed articles by their relevance to proteindisorder and (2) cost-effective enlargement of adisordered protein database.