Preventing Overfitting in Learning Text Patterns for Document Categorization

  • Authors:
  • Markus Junker;Andreas Dengel

  • Affiliations:
  • -;-

  • Venue:
  • ICAPR '01 Proceedings of the Second International Conference on Advances in Pattern Recognition
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially "as they are". Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihoodratio-statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.