Duplicate detection in adverse drug reaction surveillance

Authors:
G. Niklas Norén;Roland Orre;Andrew Bate;I. Ralph Edwards
Affiliations:
WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden and Mathematical Statistics, Stockholm University, Stockholm, Sweden;NeuroLogic Sweden AB, Stockholm, Sweden;WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden;WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden
Venue:
Data Mining and Knowledge Discovery
Year:
2007

Citing 8
Cited 4

The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
Bayesian neural networks with confidence estimations applied to data mining

Computational Statistics & Data Analysis
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
A Taxonomy of Dirty Data

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A hit-miss model for duplicate detection in the WHO drug safety database

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Temporal pattern discovery in longitudinal electronic patient records

Data Mining and Knowledge Discovery
Robust discovery of local patterns: subsets and stratification in adverse drug reaction surveillance

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Using the normalization for typographic errors in numerals

ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management
Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Computer Methods and Programs in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well.