Fully automatic wrapper generation for search engines

  • Authors:
  • Hongkun Zhao;Weiyi Meng;Zonghuan Wu;Vijay Raghavan;Clement Yu

  • Affiliations:
  • SUNY at Binghamton, Binghamton, NY;SUNY at Binghamton, Binghamton, NY;Univ. of Louisiana at Lafayette, Lafayette, LA;Univ. of Louisiana at Lafayette, Lafayette, LA;University of Illinois at Chicago, Chicago, IL

  • Venue:
  • WWW '05 Proceedings of the 14th international conference on World Wide Web
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.