Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

  • Authors:
  • Milos Kovacevic;Michelangelo Diligenti;Marco Gori;Veljko Milutinovic

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting and processing information from Webpages is an important task in many areas likeconstructing search engines, information retrieval, anddata mining from the Web. Common approach in theextraction process is to represent a page as a "bag ofwords" and then to perform additional processing onsuch a flat representation. In this paper we propose anew, hierarchical representation that includes browserscreen coordinates for every HTML object in a page.Using visual information one is able to define heuristicsfor the recognition of common page areas such asheader, left and right menu, footer and center of a page.We show in initial experiments that using our heuristicsdefined objects are recognized properly in 73% of cases.Finally, we show that a Naive Bayes classifier, takinginto account the proposed representation, clearlyoutperforms the same classifier using only informationabout the content of documents.