Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation

  • Authors:
  • Jinbeom Kang;Joongmin Choi

  • Affiliations:
  • -;-

  • Venue:
  • ISITC '07 Proceedings of the 2007 International Symposium on Information Technology Convergence
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the structure of a Web page is getting more compli- cated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discrim- inating the meaningful blocks that contain the target infor- mation from the noise blocks that contains irrelevant infor- mation such as advertisements, menus, or copyright state- ments. To solve this problem, this paper proposes the RIPB(Recognizing Informative Page Blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the vi- sual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informa- tive block information and ignore other noise information in building extraction rules.