The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Graph Twiddling in a MapReduce World
Computing in Science and Engineering
Nephele: efficient parallel data processing in the cloud
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Discovering and Maintaining Links on the Web of Data
ISWC '09 Proceedings of the 8th International Semantic Web Conference
Declarative XML data cleaning with XClean
CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Linking open government data: what journalists wish they had known
Proceedings of the 6th International Conference on Semantic Systems
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Massively parallel data analysis with PACTs on Nephele
Proceedings of the VLDB Endowment
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
TWC LOGD: A portal for linked open government data ecosystems
Web Semantics: Science, Services and Agents on the World Wide Web
Hi-index | 0.00 |
Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysis framework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration of well-known government data sources and other large open data sets at technical, structural, and semantic levels. Furthermore, we publish the integrated data on the Web in a form that enables users to discover relationships between persons, government agencies, funds, and companies. The evaluation shows that linking person entities of different data sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scales well on up to eight machines.