A Machine Learning Approach to Web Mining

  • Authors:
  • Floriana Esposito;Donato Malerba;Luigi Di Pace;Pietro Leo

  • Affiliations:
  • -;-;-;-

  • Venue:
  • AI*IA '99 Proceedings of the 6th Congress of the Italian Association for Artificial Intelligence on Advances in Artificial Intelligence
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined by HTML tags. The representation language adopted for Web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and k-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.