Automatic wrapper induction from hidden-web sources with domain knowledge

  • Authors:
  • Pierre Senellart;Avin Mittal;Daniel Muschick;Rémi Gilleron;Marc Tommasi

  • Affiliations:
  • INRIA Saclay & TELECOM ParisTech, Paris, France;Indian Institute of Technology, Bombay, India;Technische Universität Graz, Graz, Austria;Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France;Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France

  • Venue:
  • Proceedings of the 10th ACM workshop on Web information and data management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.