Automatic data fusion with HumMer

  • Authors:
  • Alexander Bilke;Jens Bleiholder;Felix Naumann;Christoph Böhm;Karsten Draba;Melanie Weis

  • Affiliations:
  • Technische Universität Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany

  • Venue:
  • VLDB '05 Proceedings of the 31st international conference on Very large data bases
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it represents identical real-world objects multiple times, causing duplicates, and it has missing values and conflicting values. The Humboldt Merger (HumMer) is a tool that allows ad-hoc, declarative fusion of such data using a simple extension to SQL.Guided by a query against multiple tables, HumMer proceeds in three fully automated steps: First, instance-based schema matching bridges schematic heterogeneity of the tables by aligning corresponding attributes. Next, duplicate detection techniques find multiple representations of identical real-world objects. Finally, data fusion and conflict resolution merges duplicates into a single, consistent, and clean representation.