Evaluation and Construction of Training Corpuses for Text Classification: A Preliminary Study

Authors:
Shuigeng Zhou;Jihong Guan
Affiliations:
-;-
Venue:
NLDB '02 Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Year:
2002

Citing 9
Cited 0

Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improved boosting algorithms using confidence-rated predictions

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierachically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure

WAIM '00 Proceedings of the First International Conference on Web-Age Information Management
Chinese Documents Classification Based on N-Grams

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is becoming more and more important with the rapid growth of on-line information available. It was observed that the quality of training corpus impacts the performance of the trained classifier. This paper proposes an approach to build high-quality training corpuses for better classification performance by first exploring the properties of training corpuses, and then giving an algorithm for constructing training corpuses semi-automatically. Preliminary experimental results validate our approach: classifiers based on the training corpuses constructed by our approach can achieve good performance while the training corpus' size is significantly reduced. Our approach can be used for building efficient and lightweight classification systems.