A utility-theoretic ranking method for semi-automated text classification

Authors:
Giacomo Berardi;Andrea Esuli;Fabrizio Sebastiani
Affiliations:
ISTI-CNR, Pisa, Italy;ISTI-CNR, Pisa, Italy;ISTI-CNR, Pisa, Italy
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 20
Cited 0

Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Document classification through interactive supervision of document and term labels

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Large-scale text categorization by batch mode active learning

Proceedings of the 15th international conference on World Wide Web
Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Active Learning with Feedback on Features and Instances

The Journal of Machine Learning Research
A new interpretation of average precision

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Active Learning Strategies for Multi-Label Text Classification

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Training Data Cleaning for Text Classification

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.