Automated retraining methods for document classification and their parameter tuning

Authors:
Stefan Siersdorfer;Gerhard Weikum
Affiliations:
Max-Planck-Institute for Computer Science, Germany;Max-Planck-Institute for Computer Science, Germany
Venue:
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Year:
2005

Citing 16
Cited 0

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
A real-time audio frequency cubic spline interpolator

Signal Processing
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Interactive shape preserving interpolation by curvature continuous rational cubic splines

Journal of Computational and Applied Mathematics - Special issue on computational methods in computer graphics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Semi-supervised support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Modern Information Retrieval

Modern Information Retrieval
The use of unlabeled data to improve supervised learning for text summarization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Exploiting unlabeled data in ensemble methods

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.