FastWrap: an efficient wrapper for tabular data extraction from the web

  • Authors:
  • Mohammad Shafkat Amin;Hasan Jamil

  • Affiliations:
  • Department of Computer Science, Wayne State University;Department of Computer Science, Wayne State University

  • Venue:
  • IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the algorithm works in linear time to generate the wrapper.