Clustering structured web sources: a schema-based, model-differentiation approach

Authors:
Bin He;Tao Tao;Kevin Chen-Chuan Chang
Affiliations:
Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Year:
2004

Citing 13
Cited 7

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
MedMaker: A Mediation System Based on Declarative Specifications

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining complex matchings across Web query interfaces

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Discovering complex matchings across web query interfaces: a correlation mining approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Schema Matching across Query Interfaces on the Deep Web

BNCOD '08 Proceedings of the 25th British national conference on Databases: Sharing Data, Information and Knowledge
Measuring similarity of chinese web databases based on category hierarchy

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Deep web integrated systems: current achievements and open issues

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web has been rapidly “deepened” with the prevalence of databases online On this “deep Web,” numerous sources are structured, providing schema-rich data Their schemas define the object domain and its query capabilities This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction) Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.