A model for handling approximate, noisy or incomplete labeling in text classification

  • Authors:
  • Ganesh Ramakrishnan;Krishna Prasad Chitrapura;Raghu Krishnapuram;Pushpak Bhattacharyya

  • Affiliations:
  • IBM India Research Lab, IIT, New Delhi, India;IBM India Research Lab, IIT, New Delhi, India;IBM India Research Lab, IIT, New Delhi, India;IIT Bombay, Mumbai, India

  • Venue:
  • ICML '05 Proceedings of the 22nd international conference on Machine learning
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical. distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling.