Automatically constructing wrappers for effective and efficient web information extraction

  • Authors:
  • Vijay V. Raghavan;Dheerendranath Mundluru

  • Affiliations:
  • University of Louisiana at Lafayette;University of Louisiana at Lafayette

  • Venue:
  • Automatically constructing wrappers for effective and efficient web information extraction
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the past decade, there has been significant interest among Web mining researchers to develop tools that enable effective and efficient communication with Web search sources, which allow one to search for information stored in back-end databases/indexes. Such tools are extremely important for systems such as metasearch engines, search engines and vertical portals. In this research, I studied three research problems that are very important when interacting with Web search sources. The first problem deals with automatically extracting records present in Web pages. These Web pages can be static HTML pages or dynamic pages returned by Web search sources in response to submitted queries. The proposed algorithm, which is the main contribution of this dissertation, takes a few sample Web pages from a given source as input and automatically constructs a wrapper. The wrapper can be used to instantly extract records present in new Web pages returned by the given source. The algorithm is based on a few important observations (e.g., contiguous records are formatted using similar HTML tags) about the way records are generally displayed in a Web page. Experiments showed that the algorithm is highly effective and efficient in extracting records and also performed considerably better than two state-of-the-art systems. The second problem deals with automatically constructing wrappers for extracting subsequent response pages, which are accessible through hyperlinks/buttons present in the first response page returned by a Web search source. The proposed algorithm is dependent on the characteristics of the URLs representing subsequent response pages. A more generic machine learning-based algorithm has also been proposed to extract subsequent pages. Experiments showed that the proposed algorithms are highly effective and efficient. The third problem deals with automatically discovering the query language features of Web search sources. The proposed algorithm submits a set of probe queries to the Web search source and analyzes the hyperlinks present in the returned response pages to identify its query language features. Unlike prior approaches, using hyperlinks as a feature in identifying the query language allowed the algorithm to achieve a very high coverage. Experiments also showed that the algorithm is highly effective and efficient.