Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

  • Authors:
  • James Caverlee;Ling Liu;David Buttler

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDE '04 Proceedings of the 20th International Conference on Data Engineering
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we introduce the concept of a QA-Pageletto refer to the content region in a dynamic page that containsquery matches. We present THOR, a scalable andefficient mining system for discovering and extracting QA-Pageletsfrom the Deep Web. A unique feature of THOR isits two-phase extraction framework. In the first phase, pagesfrom a deep web site are grouped into distinct clusters ofstructurally-similar pages. In the second phase, pages fromeach page cluster are examined through a subtree filteringalgorithm that exploits the structural and content similarityat subtree level to identify the QA-Pagelets.