A New Framework for Uncertainty Sampling: Exploiting Uncertain and Positive-Certain Examples in Similarity-Based Text Classification

  • Authors:
  • Kang H. Lee;Byeong H. Kang

  • Affiliations:
  • -;-

  • Venue:
  • ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the major concerns with supervised learningapproaches to text classification is that they require alarge number of labeled examples to achieve a high levelof effectiveness. Labeling such a large number ofexamples poses a considerable burden on human experts.Two common approaches to reduce the amount of labeledexamples required are: (1) selecting informativeuncertain examples for human-labeling and (2) usingmany inexpensive unlabeled data with a small number oflabeled examples. While previous work in textclassification focused only on one approach, weinvestigate a framework to combine both approaches insimilarity-based text classification. By applying our newthresholding strategy (RinSCut) to uncertainty sampling,we propose a new framework which automatically selectsinformative uncertain data that should be presented tohuman expert for labeling and positive-certain data thatare directly used for learning without human-labeling.With our similarity-based learning algorithm (KAN),experiments have been conducted on Reuters-21578 dataset. Our proposed scheme has been compared withrandom sampling and previous conventional uncertaintysampling, based on micro and macro-averaged F1. Theresults showed that if both macro and micro-averagedmeasures are concerned, the optimal choice might be ourframework.