Preventing Overfitting in Learning Text Patterns for Document Categorization

Authors:
Markus Junker;Andreas Dengel
Affiliations:
-;-
Venue:
ICAPR '01 Proceedings of the Second International Conference on Advances in Pattern Recognition
Year:
2001

Citing 9
Cited 1

TCS: a shell for content-based text categorization

Proceedings of the sixth conference on Artificial intelligence applications
Overfitting Avoidance as Bias

Machine Learning
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BEXA: a covering algorithm for learning propositional concept descriptions

Machine Learning
Separate-and-Conquer Rule Learning

Artificial Intelligence Review
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
The CN2 Induction Algorithm

Machine Learning
Induction of Decision Trees

Machine Learning

smartFIX: A Requirements-Driven System for Document Analysis and Understanding

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially "as they are". Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihoodratio-statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.