Clustering structured web sources: a schema-based, model-differentiation approach

  • Authors:
  • Bin He;Tao Tao;Kevin Chen-Chuan Chang

  • Affiliations:
  • Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL

  • Venue:
  • EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web has been rapidly “deepened” with the prevalence of databases online On this “deep Web,” numerous sources are structured, providing schema-rich data Their schemas define the object domain and its query capabilities This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction) Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.