Identifying and weighting integration hypotheses on open data platforms

  • Authors:
  • Julian Eberius;Katrin Braunschweig;Maik Thiele;Wolfgang Lehner

  • Affiliations:
  • Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany

  • Venue:
  • Proceedings of the First International Workshop on Open Data
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Open data platforms such as data.gov or opendata.socrata. com provide a huge amount of valuable information, publicly available to anyone. This data has the potential to drive innovation and lead to a more democratic and transparent society. Still, the platforms it is offered on have some unique problems: Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems, such as duplicated or partitioned datasets. At the same time, crowd-based data integration techniques are emerging as new way of dealing with data integration problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper identifies several classes of integration problems on Open Data Platforms, and proposes a method for identifying and ranking potential them in this context. In this method, an Open Data Platform is modeled as a graph of datasets, so that potentital integration problems, called integration hypotheses, can be identified by analyzing the graph for specific patterns. The paper concludes with a comprehensive evaluation using one of the largest Open Data platforms, opendata.socrata.com.