Goal-oriented methods and meta methods for document classification and their parameter tuning

Authors:
Stefan Siersdorfer;Sergej Sizov;Gerhard Weikum
Affiliations:
Max-Planck-Institut fur Informatik, Saarbruecken, Germany;Max-Planck-Institut fur Informatik, Saarbruecken, Germany;Max-Planck-Institut fur Informatik, Saarbruecken, Germany
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 18
Cited 4

Original Contribution: Stacked generalization

Neural Networks
Bagging predictors

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
An adaptive version of the boost by majority algorithm

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Predicting the cost-quality trade-off for information retrieval queries: facilitating database design and query optimization

Proceedings of the tenth international conference on Information and knowledge management
Methods and metrics for cold-start recommendations

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Predicting query performance

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Estimating the Generalization Performance of an SVM Efficiently

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Heterogeneous Learner for Web Page Classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An extensible meta-learning approach for scalable and accurate inductive learning

An extensible meta-learning approach for scalable and accurate inductive learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

The database research group at the Max-Planck Institute for Informatics

ACM SIGMOD Record
Meta methods for model sharing in personal information systems

ACM Transactions on Information Systems (TOIS)
Using restrictive classification and meta classification for junk elimination

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Automatic document organization in a p2p environment

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic text classification methods come with various calibration parameters such as thresholds for probabilities in Bayesian classifiers or for hyperplane distances in SVM classifiers. In a given application context these parameters should be set so as to meet the relative importance of various result quality metrics such as precision versus recall. In this paper we consider classifiers that can accept a document for a topic, reject it, or abstain. We aim to meet the application's goals in terms of accuracy (i.e., avoid false acceptances or rejections) and loss (i.e., limit the fraction of documents for which no decision is made). To this end we investigate restrictive forms of Support Vector Machine classifiers and we develop meta methods that split the training data into subsets for independently trained classifiers and then combine the results of these classifiers. These techniques tend to improve accuracy at the expense of document loss. We develop estimators that help to predict the accuracy and loss for a given setting of the methods' tuning parameters, and a methodology for efficiently deriving a setting that meets the application's goals. Our experiments confirm the practical viability of the approach.