Context-aware wrapping: synchronized data extraction

  • Authors:
  • Shui-Lung Chuang;Kevin Chen-Chuan Chang;ChengXiang Zhai

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • VLDB '07 Proceedings of the 33rd international conference on Very large data bases
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17--83% to 78--94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09--60% to 01--11%).