Web page genre classification

  • Authors:
  • Guangyu Chen;Ben Choi

  • Affiliations:
  • Louisiana Tech University, LA;Louisiana Tech University, LA

  • Venue:
  • Proceedings of the 2008 ACM symposium on Applied computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present an automatic genre-based Web page classification system. Unlike subject or topic based classifications, genre-based classifications focus on functional purposes and classify web pages into categories such as online shopping, technical paper, or discussion forum. Until now, the genre classifications are not well developed due to the subjectivities and difficulties to define the genre, the features, and even the categories. In this paper, we define five top-level genre categories, each of which has several subcategories, and develop new methods to extract 31 features from Web pages to identify the categories. We analyze not only the contents of the Web pages, but also the URLs, HTML tags, Java scripts, and VB scripts. We developed a genre classification system that achieved average accuracy of 93%. In addition, we combined this genre classification with our subject-based classification to produce a comprehensive Web page classification system.