Open-Set classification for automated genre identification

  • Authors:
  • Dimitrios A. Pritsos;Efstathios Stamatatos

  • Affiliations:
  • University of the Aegean, Karlovassi, Samos, Greece;University of the Aegean, Karlovassi, Samos, Greece

  • Venue:
  • ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, e-shops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.