A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
As the structure of a Web page is getting more compli- cated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discrim- inating the meaningful blocks that contain the target infor- mation from the noise blocks that contains irrelevant infor- mation such as advertisements, menus, or copyright state- ments. To solve this problem, this paper proposes the RIPB(Recognizing Informative Page Blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the vi- sual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informa- tive block information and ignore other noise information in building extraction rules.