Two supervised learning approaches for name disambiguation in author citations
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
ArnetMiner: extraction and mining of academic social networks
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Author name disambiguation in MEDLINE
ACM Transactions on Knowledge Discovery from Data (TKDD)
Using web information for author name disambiguation
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Journal of the American Society for Information Science and Technology
ADANA: Active Name Disambiguation
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
A Unified Probabilistic Framework for Name Disambiguation in Digital Library
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., "Jaccard Coefficient", between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures perform bad when the two sets are small, which is typical in Author Name Disambiguation. In this paper, we propose a novel categorical set similarity measure. We model an author's preference, e.g., to venues, using a categorical distribution, and derive a likelihood ratio to estimate the likelihood that the two sets are drawn from the same distribution. This likelihood ratio is used as the similarity measure to decide whether two sets belong to the same author. This measure is mathematically principled and verified to perform well even when the cardinalities of the two compared sets are small. Additionally, we propose a new method to estimate the number of distinct authors for a given name based on the name statistics extracted from a digital library. Experiment shows that our method significantly outperforms a baseline method, a widely used benchmark method, and a real system.