A New Framework for Uncertainty Sampling: Exploiting Uncertain and Positive-Certain Examples in Similarity-Based Text Classification

Authors:
Kang H. Lee;Byeong H. Kang
Affiliations:
-;-
Venue:
ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Year:
2004

Citing 0
Cited 1

A new model for classifying DNA code inspired by neural networks and FSA

PKAW'06 Proceedings of the 9th Pacific Rim Knowledge Acquisition international conference on Advances in Knowledge Acquisition and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the major concerns with supervised learningapproaches to text classification is that they require alarge number of labeled examples to achieve a high levelof effectiveness. Labeling such a large number ofexamples poses a considerable burden on human experts.Two common approaches to reduce the amount of labeledexamples required are: (1) selecting informativeuncertain examples for human-labeling and (2) usingmany inexpensive unlabeled data with a small number oflabeled examples. While previous work in textclassification focused only on one approach, weinvestigate a framework to combine both approaches insimilarity-based text classification. By applying our newthresholding strategy (RinSCut) to uncertainty sampling,we propose a new framework which automatically selectsinformative uncertain data that should be presented tohuman expert for labeling and positive-certain data thatare directly used for learning without human-labeling.With our similarity-based learning algorithm (KAN),experiments have been conducted on Reuters-21578 dataset. Our proposed scheme has been compared withrandom sampling and previous conventional uncertaintysampling, based on micro and macro-averaged F1. Theresults showed that if both macro and micro-averagedmeasures are concerned, the optimal choice might be ourframework.