WMS-extracting multiple sections data records from search engine results pages

  • Authors:
  • Jer Lang Hong;Eu-Gene Siew;Simon Egerton

  • Affiliations:
  • Monash University;Monash University;Monash University

  • Venue:
  • Proceedings of the 2010 ACM Symposium on Applied Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we develop an automatic wrapper for the extraction of multiple sections data records from search engine results pages. In the Information Extraction world, less attention has been focused on the development of wrappers for the extraction of multiple sections data records. This is evidenced by the fact that there is only one automatic wrapper, MSE developed for this purpose. Using the separation distance of data records and sections, MSE is able to distinguish sections and data records and extract them from search engine results pages. In this study, our approach is the use of DOM tree properties to develop an adaptive search method which is able to detect, differentiate, and partition sections and data records. The multiple sections data records labeled are used to pass through a few filtering stages, each filter is designed to filter out a particular group of irrelevant data until one data region containing the relevant records is found. Our filtering rules are designed based on visual cue such as text and image size obtained from the browser rendering engine. Experimental results show that our wrapper is able to obtain better results than the currently available MSE wrapper.