The Normalized Compression Distance as a Distance Measure in Entity Identification

Authors:
Sebastian Klenk;Dennis Thom;Gunther Heidemann
Affiliations:
Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569;Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569;Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569
Venue:
ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Year:
2009

Citing 18
Cited 0

Linear Algorithm for Data Compression via String Matching

Journal of the ACM (JACM)
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation

Information Systems
Semantic matching across heterogeneous data sources

Communications of the ACM - The patent holder's dilemma: buy, sell, or troll?
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data

The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music

AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
A two-step classification approach to unsupervised record linkage

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
Clustering by compression

IEEE Transactions on Information Theory
The Normalized Compression Distance Is Resistant to Noise

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the --- often manual --- unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible. We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.