Citation-based bootstrapping for large-scale author disambiguation

Authors:
Michael Levin;Stefan Krawczyk;Steven Bethard;Dan Jurafsky
Affiliations:
Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA94305-9025;Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA94305-9025;Center for Computational Language and Education Research, University of Colorado Boulder, Boulder, CO80309-0594;Linguistics Department, Stanford University, 450 Serra Mall, Stanford, CA94305
Venue:
Journal of the American Society for Information Science and Technology
Year:
2012

Citing 26
Cited 3

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Random Forests

Machine Learning
The myth of the double-blind review?: author identification using only citations

ACM SIGKDD Explorations Newsletter
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles

Journal of the American Society for Information Science and Technology
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
A model-theoretic coreference scoring scheme

MUC6 '95 Proceedings of the 6th conference on Message understanding
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Weakly supervised natural language learning without redundant views

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Effective self-training for parsing

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
On co-authorship for author disambiguation

Information Processing and Management: an International Journal
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Improving author coreference by resource-bounded information gathering from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
On Graph-Based Name Disambiguation

Journal of Data and Information Quality (JDIQ)
Author name disambiguation

Annual Review of Information Science and Technology
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Active associative sampling for author name disambiguation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Ambiguous author query detection using crowdsourced digital library annotations

Information Processing and Management: an International Journal
A relevance feedback approach for the author name disambiguation problem

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1. © 2012 Wiley Periodicals, Inc.