A model for handling approximate, noisy or incomplete labeling in text classification

Authors:
Ganesh Ramakrishnan;Krishna Prasad Chitrapura;Raghu Krishnapuram;Pushpak Bhattacharyya
Affiliations:
IBM India Research Lab, IIT, New Delhi, India;IBM India Research Lab, IIT, New Delhi, India;IBM India Research Lab, IIT, New Delhi, India;IIT Bombay, Mumbai, India
Venue:
ICML '05 Proceedings of the 22nd international conference on Machine learning
Year:
2005

Citing 8
Cited 5

Information geometry of the EM and em algorithms for neural networks

Neural Networks
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Text Classification with Evolving Label-Sets

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Two-level hierarchical combination method for text classification

Expert Systems with Applications: An International Journal
Automatic image semantic interpretation using social action and tagging data

Multimedia Tools and Applications
Assessor disagreement and text classifier accuracy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Sentiment analysis on evolving social streams: how self-report imbalances can help

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical. distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling.