Identifying and weighting integration hypotheses on open data platforms

Authors:
Julian Eberius;Katrin Braunschweig;Maik Thiele;Wolfgang Lehner
Affiliations:
Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the First International Workshop on Open Data
Year:
2012

Citing 9
Cited 0

Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Matching Schemas in Online Communities: A Web 2.0 Approach

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Human-assisted graph search: it's okay to ask questions

Proceedings of the VLDB Endowment
CrowdDB: answering queries with crowdsourcing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Pay-as-you-go mapping selection in dataspaces

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Human-powered sorts and joins

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Open data platforms such as data.gov or opendata.socrata. com provide a huge amount of valuable information, publicly available to anyone. This data has the potential to drive innovation and lead to a more democratic and transparent society. Still, the platforms it is offered on have some unique problems: Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems, such as duplicated or partitioned datasets. At the same time, crowd-based data integration techniques are emerging as new way of dealing with data integration problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper identifies several classes of integration problems on Open Data Platforms, and proposes a method for identifying and ranking potential them in this context. In this method, an Open Data Platform is modeled as a graph of datasets, so that potentital integration problems, called integration hypotheses, can be identified by analyzing the graph for specific patterns. The paper concludes with a comprehensive evaluation using one of the largest Open Data platforms, opendata.socrata.com.