Author name disambiguation in MEDLINE

Authors:
Vetle I. Torvik;Neil R. Smalheiser
Affiliations:
University of Illinois at Chicago, Chicago, IL, USA;University of Illinois at Chicago, Chicago, IL, USA
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2009

Citing 20
Cited 22

Using clustering strategies for creating authority files

Journal of the American Society for Information Science
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles

Journal of the American Society for Information Science and Technology
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Search engine driven author disambiguation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient topic-based unsupervised name disambiguation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Approximate personal name-matching through finite-state graphs

Journal of the American Society for Information Science and Technology
A quantitative model for linking two disparate sets of articles in MEDLINE

Bioinformatics
Survey on test collections and techniques for personal name matching

International Journal of Metadata, Semantics and Ontologies
Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE

Computer Methods and Programs in Biomedicine
Improving author coreference by resource-bounded information gathering from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Author name disambiguation

Annual Review of Information Science and Technology
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
Recent research for MEDLINE/PubMed: short review

DTMBIO '10 Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics
Evidence-based medicine, the essential role of systematic reviews, and the need for automated text mining tools

Proceedings of the 1st ACM International Health Informatics Symposium
Who shares? Who doesn't?: bibliometric factors associated with open archiving of biomedical datasets

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
A method to track dataset reuse in biomedicine: filtered GEO accession numbers in PubMed central

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction

Journal of Biomedical Informatics
That's 'é' not 'þ' '?' or '◓': a user-driven context-aware approach to erroneous metadata in digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Metadata enrichment via topic models for author name disambiguation

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate

Proceedings of the 20th ACM international conference on Information and knowledge management
Authormagic: an approach to author disambiguation in large-scale digital libraries

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatic annotation of bibliographical references in digital humanities books, articles and blogs

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Cost-effective on-demand associative author name disambiguation

Information Processing and Management: an International Journal
A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
Active associative sampling for author name disambiguation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Citation-based bootstrapping for large-scale author disambiguation

Journal of the American Society for Information Science and Technology
A brief survey of automatic methods for author name disambiguation

ACM SIGMOD Record
Author name disambiguation: What difference does it make in author-based citation analysis?

Journal of the American Society for Information Science and Technology
Author name disambiguation using a new categorical distribution similarity

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
A relevance feedback approach for the author name disambiguation problem

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
A search engine approach to estimating temporal changes in gender orientation of first names

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Effective string processing and matching for author disambiguation

Proceedings of the 2013 KDD Cup 2013 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. Methods: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. Results: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ∼98.8%. Lumping (putting two different individuals into the same cluster) affects ∼0.5% of clusters, whereas splitting (assigning articles written by the same individual to 1 cluster) affects ∼2% of articles. Impact: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. Availability: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.