Semi-automatic document classification: exploiting document difficulty

Authors:
Miguel Martinez-Alvarez;Sirvan Yahyaei;Thomas Roelleke
Affiliations:
Queen Mary, University of London, UK and Globe Business Publishing Ltd., UK;Queen Mary, University of London, UK;Queen Mary, University of London, UK
Venue:
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Year:
2012

Citing 3
Cited 0

A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
What makes a query difficult?

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are circumstances where classification is required only if a certain condition, such a specific level of quality, is met. This paper investigates a semi-automatic solution where only the predictions for the documents which are more likely to be correctly classified would be considered. This method provides high-quality automatic classification for large subsets of the collection and employs human expertise for the "most complicated" decisions. This research presents different approaches to measure document difficulty and it discusses the benefits of applying it for semi-automatic classification. In addition, experiments are carried out to show the results achieved for different subsets of the collection. Experiments prove that it is possible to improve quality significantly with large subsets (i.e. 13% micro-f1 increase with 70% of documents) of two different collections. Furthermore, it shows how it provides a flexible mechanism to apply automatic classification to specific subsets while specific constrains are met.