Towards Automatic Web Genre Identification

  • Authors:
  • G. Rehm

  • Affiliations:
  • -

  • Venue:
  • HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We analyse academic Web pages in order to automatically classify them into Web genres. For this purpose, we have developed a database-driven corpus, currently containing 1300000+ documents, which comprises our empirical research basis. We introduce the notions of Web genre type which constitutes the framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and operate as modifiers for the default assignment. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy into which Web genre types and modules are embedded. The analysis of four documents of the Web Genre Academic's Personal Homepage demonstrates our approach and our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of unstructured HTML documents.