Linear Algorithm for Data Compression via String Matching
Journal of the ACM (JACM)
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic matching across heterogeneous data sources
Communications of the ACM - The patent holder's dilemma: buy, sell, or troll?
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
A two-step classification approach to unsupervised record linkage
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Data & Knowledge Engineering
IEEE Transactions on Information Theory
The Normalized Compression Distance Is Resistant to Noise
IEEE Transactions on Information Theory
Hi-index | 0.00 |
The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the --- often manual --- unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible. We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.