Extraction of Informative Blocks from Web Pages

  • Authors:
  • YuJuan Cao;ZhenDong Niu;LiuLing Dai;YuMing Zhao

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ALPIT '08 Proceedings of the 2008 International Conference on Advanced Language Processing and Web Information Technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.