A probabilistic model for approximate identity matching

Authors:
G. Alan Wang;Hsinchun Chen;Homa Atabakhsh
Affiliations:
University of Arizona, Tucson, AZ;University of Arizona, Tucson, AZ;University of Arizona, Tucson, AZ
Venue:
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Year:
2006

Citing 7
Cited 0

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

IEEE Transactions on Knowledge and Data Engineering
Automatically detecting deceptive criminal identities

Communications of the ACM - Homeland security
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Open code for digital government

dg.o '03 Proceedings of the 2003 annual national conference on Digital government research
Cross-jurisdictional activity networks to support criminal investigations

dg.o '04 Proceedings of the 2004 annual national conference on Digital government research
Discovering identity problems: a case study

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identity management is critical to various governmental practices ranging from providing citizens services to enforcing homeland security. The task of searching for a specific identity is difficult because multiple identity representations may exist due to issues related to unintentional errors and intentional deception. We propose a probabilistic Naïve Bayes model that improves existing identity matching techniques in terms of effectiveness. Experiments show that our proposed model performs significantly better than the exact-match based technique as well as the approximate-match based record comparison algorithm. In addition, our model greatly reduces the efforts of manually labeling training instances by employing a semi-supervised learning approach. This training method outperforms both fully supervised and unsupervised learning. With a training dataset that only contains 10% labeled instances, our model achieves a performance comparable to that of a fully supervised learning.