Merging Interface Schemas on the Deep Web via Clustering Aggregation

Authors:
Wensheng Wu;AnHai Doan;Clement Yu
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Chicago
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 4
Cited 6

Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Clustering Aggregation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Learning to extract form labels

Proceedings of the VLDB Endowment
Analyzing and revising data integration schemas to improve their matchability

Proceedings of the VLDB Endowment
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A query interface matching approach based on extended evidence theory for deep web

Journal of Computer Science and Technology
ETTA-IM: A deep web query interface matching approach based on evidence theory and task assignment

Expert Systems with Applications: An International Journal
Measuring similarity of chinese web databases based on category hierarchy

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of integrating a large number of interface schemas over the Deep Web, The scale of the problem and the diversity of the sources present serious challenges to the conventional manual or rule-based approaches to schema integration. To address these challenges, we propose a novel formulation of schema integration as an optimization problem, with the objective of maximally satisfying the constraints given by individual schemas. Since the optimization problem can be shown to be NP-complete, we develop a novel approximation algorithm LMax, which builds the unified schema via recursive applications of clustering aggregation. We further extend LMax to handle the irregularities frequently occurring among the interface schemas. Extensive evaluation on real-world data sets shows the effectiveness of our approach.