Automatic genre detection of web documents

  • Authors:
  • Chul Su Lim;Kong Joo Lee;Gil Chang Kim

  • Affiliations:
  • Division of Computer Science, Department of EECS, KAIST, Taejon;School of Computer & Information Technology, KyungIn Women’s College, Incheon;Division of Computer Science, Department of EECS, KAIST, Taejon

  • Venue:
  • IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

A genre or a style is another view of documents different from a subject or a topic. The genre is also a criterion to classify the documents. There have been several studies on detecting a genre of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect genres of web documents. Web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.