Wrapper generation for automatic data extraction from large web sites

  • Authors:
  • Nitin Jindal

  • Affiliations:
  • Department of Computer Science and Engineering, Indian Institute of Technology, Delhi

  • Venue:
  • DNIS'05 Proceedings of the 4th international conference on Databases in Networked Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper investigates techniques for extracting data from large set of dynamic web pages. Dynamically generated web pages from a single web site have a common semi structure for all the data objects. A wrapper of these dynamic web pages is defined as a common template for these pages with different data objects embedded in each web page. Information Extraction is done in three steps: (a) Data Rich Section Extraction from each web page (b) Automated generation of wrapper (c) Data extraction from each web page by comparing it with the wrapper. Wrapper generation is the most important part of this process. Our focus was on developing new improved techniques for wrapper generation. Our technique is fully automated and we were able to achieve good increase in accuracy and speed.