Publish-time data integration for open data platforms

Authors:
Julian Eberius;Patrick Damme;Katrin Braunschweig;Maik Thiele;Wolfgang Lehner
Affiliations:
Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the 2nd International Workshop on Open Data
Year:
2013

Citing 4
Cited 0

Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Exploring schema repositories with schemr

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives.