Genre Categorization of Web Pages

  • Authors:
  • Jebari Chaker;Ounelli Habib

  • Affiliations:
  • -;-

  • Venue:
  • ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to classify documents different from the topic. Most of these works assign a document to only one genre. In this paper we propose a new flexible approach for document genre categorization. Flexibility means that our approach assigns a document to all predefined genres with different weights. The proposed approach is based on the combination of two homogenous classifiers: contextual and structural classifiers. The contextual classifier uses the URL, while the structural classifier uses the document structure. Both contextual and structural classifiers are centroid-based classifiers. Experimentations provide a micro-averaged break- even point (BEP) more than 85%, which is better than those obtained by other categorization approaches.