Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Authors:
Hatem A. Mahmoud;Ashraf Aboulnaga
Affiliations:
University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 13
Cited 4

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Machine Learning

Machine Learning
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Structured databases on the web: observations and implications

ACM SIGMOD Record
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Merging Interface Schemas on the Deep Web via Clustering Aggregation

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Data integration with uncertainty

The VLDB Journal — The International Journal on Very Large Data Bases

A novel measure of edge centrality in social networks

Knowledge-Based Systems
Incrementally improving dataspaces based on user feedback

Information Systems
Publish-time data integration for open data platforms

Proceedings of the 2nd International Workshop on Open Data
Big data challenge: a data management perspective

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.