A case-study on naïve labelling for the nearest mean and the linear discriminant classifiers

  • Authors:
  • L. I. Kuncheva;C. J. Whitaker;A. Narasimhamurthy

  • Affiliations:
  • School of Computer Science, Bangor University, Bangor LL57 1UT, UK;School of Psychology, Bangor University 1UT, UK;School of Computer Science and Informatics, University College Dublin (UCD), Dublin, Ireland

  • Venue:
  • Pattern Recognition
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

The abundance of unlabelled data alongside limited labelled data has provoked significant interest in semi-supervised learning methods. ''Naive labelling'' refers to the following simple strategy for using unlabelled data in on-line classification. A new data point is first labelled by the current classifier and then added to the training set together with the assigned label. The classifier is updated before seeing the subsequent data point. Although the danger of a run-away classifier is obvious, versions of naive labelling pervade in on-line adaptive learning. We study the asymptotic behaviour of naive labelling in the case of two Gaussian classes and one variable. The analysis shows that if the classifier model assumes correctly the underlying distribution of the problem, naive labelling will drive the parameters of the classifier towards their optimal values. However, if the model is not guessed correctly, the benefits are outweighed by the instability of the labelling strategy (run-away behaviour of the classifier). The results are based on exact calculations of the point of convergence, simulations, and experiments with 25 real data sets. The findings in our study are consistent with concerns about general use of unlabelled data, flagged up in the recent literature.