Estimating accuracy for text classification tasks on large unlabeled data

Authors:
Snigdha Chaturvedi;Tanveer A. Faruquie;L. Venkata Subramaniam;Mukesh K. Mohania
Affiliations:
IBM Research India, New Delhi, India;IBM Research India, New Delhi, India;IBM Research India, New Delhi, India;IBM India Software Lab, New Delhi, India
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 10
Cited 0

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Estimating the Predictive Accuracy of a Classifier

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Meta-Learning by Landmarking Various Learning Algorithms

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Efficient inference on sequence segmentation models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Detecting Fractures in Classifier Performance

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Semi-Supervised Learning

Semi-Supervised Learning
Transfer of Supervision for Improved Address Standardization

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.