Multiple sets of features for automatic genre classification of web documents

  • Authors:
  • Chul Su Lim;Kong Joo Lee;Gil Chang Kim

  • Affiliations:
  • Division of Computer Science, Department of EECS, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea;School of Computer and Information Technology, KyungIn Women's College, 101 Kyesan-dong, Gyeyang-gu, Incheon 407-740, South Korea;Division of Computer Science, Department of EECS, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.