Organizing structured web sources by query schemas: a clustering approach

  • Authors:
  • Bin He;Tao Tao;Kevin Chen-Chuan Chang

  • Affiliations:
  • University of Illinois at Urbana-Champaign, IL;University of Illinois at Urbana-Champaign, IL;University of Illinois at Urbana-Champaign, IL

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas ie, attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schema-based clustering accurately organizes sources by object domains eg, Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.