ObjectRunner: lightweight, targeted extraction and querying of structured web data

  • Authors:
  • Talel Abdessalem;Bogdan Cautis;Nora Derouiche

  • Affiliations:
  • Télécom ParisTech - CNRS LTCI, Paris, France;Télécom ParisTech - CNRS LTCI, Paris, France;Télécom ParisTech - CNRS LTCI, Paris, France

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present in this paper ObjectRunner, a system for extracting, integrating and querying structured data from the Web. Our system harvests real-world items from template-based HTML pages (the so-called structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the targeted data is first provided, in a flexible and widely applicable manner. ObjectRunner follows then a lightweight, best-effort approach, leveraging both the input description and the source structure. This process is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. We advocate via our prototype that fully automatic extraction and integration of structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data. We present the technical details and the overall platform through several application scenarios on real-life Web sources.