Automatic data fusion with HumMer

Authors:
Alexander Bilke;Jens Bleiholder;Felix Naumann;Christoph Böhm;Karsten Draba;Melanie Weis
Affiliations:
Technische Universität Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany
Venue:
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Year:
2005

Citing 5
Cited 8

XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Declarative data fusion – syntax, semantics, and implementation

ADBIS'05 Proceedings of the 9th East European conference on Advances in Databases and Information Systems

FuSem: exploring different semantics of data fusion

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data fusion

ACM Computing Surveys (CSUR)
Methodologies for data quality assessment and improvement

ACM Computing Surveys (CSUR)
A framework for semantic link discovery over relational data

Proceedings of the 18th ACM conference on Information and knowledge management
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Data integration systems for scientific applications

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
BioFuice: mapping-based data integration in bioinformatics

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
A method for similarity-based grouping of biological data

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it represents identical real-world objects multiple times, causing duplicates, and it has missing values and conflicting values. The Humboldt Merger (HumMer) is a tool that allows ad-hoc, declarative fusion of such data using a simple extension to SQL.Guided by a query against multiple tables, HumMer proceeds in three fully automated steps: First, instance-based schema matching bridges schematic heterogeneity of the tables by aligning corresponding attributes. Next, duplicate detection techniques find multiple representations of identical real-world objects. Finally, data fusion and conflict resolution merges duplicates into a single, consistent, and clean representation.