Author name disambiguation using a new categorical distribution similarity

Authors:
Shaohua Li;Gao Cong;Chunyan Miao
Affiliations:
Nanyang Technological University, Singapore;Nanyang Technological University, Singapore;Nanyang Technological University, Singapore
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Year:
2012

Citing 8
Cited 0

Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
ArnetMiner: extraction and mining of academic social networks

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Using web information for author name disambiguation

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations

Journal of the American Society for Information Science and Technology
ADANA: Active Name Disambiguation

ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
A Unified Probabilistic Framework for Name Disambiguation in Digital Library

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., "Jaccard Coefficient", between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures perform bad when the two sets are small, which is typical in Author Name Disambiguation. In this paper, we propose a novel categorical set similarity measure. We model an author's preference, e.g., to venues, using a categorical distribution, and derive a likelihood ratio to estimate the likelihood that the two sets are drawn from the same distribution. This likelihood ratio is used as the similarity measure to decide whether two sets belong to the same author. This measure is mathematically principled and verified to perform well even when the cardinalities of the two compared sets are small. Additionally, we propose a new method to estimate the number of distinct authors for a given name based on the name statistics extracted from a digital library. Experiment shows that our method significantly outperforms a baseline method, a widely used benchmark method, and a real system.