Hierarchical classification of web documents by stratified discriminant analysis

  • Authors:
  • Juan Carlos Gomez;Marie-Francine Moens

  • Affiliations:
  • Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium;Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium

  • Venue:
  • IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.