Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework

Authors:
Hamid Haidarian Shahri;Saied Haidarian Shahri
Affiliations:
University of Maryland;University of Tehran
Venue:
IEEE Intelligent Systems
Year:
2006

Citing 6
Cited 3

A knowledge-based approach for duplicate elimination in data cleaning

Information Systems - Data extraction, cleaning and reconciliation
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

A graph-based approach to vehicle tracking in traffic camera video streams

DMSN '07 Proceedings of the 4th workshop on Data management for sensor networks: in conjunction with 33rd International Conference on Very Large Data Bases
DWEVOLVE: a requirement based framework for data warehouse evolution

ACM SIGSOFT Software Engineering Notes
A case study for integrating public safety data using semantic technologies

Information Polity - Special issue on Public Engagement and Government Collaboration: Theories, Strategies and Case Studies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate duplicate elimination is an important data-integration task, but its complex comparisons of many records involvinguncertainty and ambiguity make it difficult. Earlier approaches required a time-consuming and tedious process of hard coding of staticrules based on a schema. A novel duplicate-elimination framework now lets users clean data flexibly and effortlessly, without any coding.Exploiting fuzzy inference inherently handles the problem's uncertainty, and unique machine learning capabilities let the framework adaptto the specific notion of similarity appropriate for each domain. The framework is extensible and accommodative, letting the user operatewith or without training data. Additionally, many of the previous methods for duplicate elimination can be implemented quickly using thisframework.