Reduction of training noises for text classifiers

Authors:
Rey-Long Liu
Affiliations:
Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan
Venue:
ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Year:
2013

Citing 11
Cited 0

Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Evaluation of Passage-Based Text Categorization

Journal of Intelligent Information Systems
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
TSCAN: a novel method for topic summarization and content anatomy

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Passage detection using text classification

Journal of the American Society for Information Science and Technology
A proximity language model for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research
How good is a span of terms?: exploiting proximity to improve web retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Proximity-based opinion retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic text classification (TC) is essential for the archiving and retrieval of texts, which are main ways of recording information and expertise. Previous studies thus have developed many text classifiers. They often employed training texts to build the classifiers, and showed that the classifiers had good performance in various application domains. However, as the training texts are often inevitably unsound or incomplete in practice, they often contain many terms not related to the categories of interest. Such terms are actually training noises in classifier training, and hence can deteriorate the performance of the classifiers. Reduction of the training noises is thus essential. It is also quite challenging as training texts are unsound or incomplete. In this paper, we develop a technique TNR (Training Noise Reduction) to remove the possible training noises so that the performance of the classifiers can be further improved. Given a training text d of a category c, TNR identifies a sequence of consecutive terms (in d) as the noises if the terms are not strongly related to c. A case study on the classification of Chinese texts of disease information shows that TNR can improve a Support Vector Machine (SVM) classifier, which is a state-of-the-art classifier in TC. The contribution is of significance to the further enhancement of existing text classifiers.