A hit-miss model for duplicate detection in the WHO drug safety database

Authors:
G. Niklas Norén;Roland Orre;Andrew Bate
Affiliations:
WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden & Stockholm University, Stockholm, Sweden;NeuroLogic Sweden AB, Stockholm, Sweden;WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 4
Cited 4

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Bayesian neural networks with confidence estimations applied to data mining

Computational Statistics & Data Analysis
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Duplicate detection in adverse drug reaction surveillance

Data Mining and Knowledge Discovery
Spotting out emerging artists using geo-aware analysis of P2P query strings

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction incidents that occur after drugs are introduced on the market. As in other post-marketing drug safety data sets, the presence of duplicate records is an important data quality problem and the detection of duplicates in the WHO drug safety database remains a formidable challenge, especially since the reports are anonymised before submitted to the database. However, to our knowledge no work has been published on methods for duplicate detection in post-marketing drug safety data. In this paper, we propose a method for probabilistic duplicate detection based on the hit-miss model for statistical record linkage described by Copas & Hilton. We present two new generalisations of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields. We demonstrate the effectiveness of the hit-miss model for duplicate detection in the WHO drug safety database both at identifying the most likely duplicate for a given record (94.7% accuracy) and at discriminating duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other applications throughout the KDD community.