Efficient name disambiguation for large-scale databases

Authors:
Jian Huang;Seyda Ertekin;C. Lee Giles
Affiliations:
College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA;College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA
Venue:
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Year:
2006

Citing 12
Cited 38

The nature of statistical learning theory

The nature of statistical learning theory
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Fast Kernel Classifiers with Online and Active Learning

The Journal of Machine Learning Research

Efficient topic-based unsupervised name disambiguation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Collaboration over time: characterizing and modeling network evolution

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Using web information for creating publication venue authority files

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Keeping a digital library clean: new solutions to old problems

Proceedings of the eighth ACM symposium on Document engineering
Author Name Disambiguation for Citations Using Topic and Web Correlation

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
On co-authorship for author disambiguation

Information Processing and Management: an International Journal
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using web information for author name disambiguation

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Solving the "Who's Mark Johnson" puzzle: information extraction based cross document coreference

SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
SyGAR: a synthetic data generator for evaluating name disambiguation methods

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web

WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
Enhancing cross document coreference of web documents with context similarity and very large scale text categorization

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Evaluating entity resolution results

Proceedings of the VLDB Endowment
A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments

Journal of the American Society for Information Science and Technology
Construction of a large-scale test set for author disambiguation

Information Processing and Management: an International Journal
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Resolving author name homonymy to improve resolution of structures in co-author networks

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Metadata enrichment via topic models for author name disambiguation

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Combining machine learning and human judgment in author disambiguation

Proceedings of the 20th ACM international conference on Information and knowledge management
Author disambiguation using multi-aspect similarity indicators

Scientometrics
Disambiguating authors in citations on the web and authorship correlations

Expert Systems with Applications: An International Journal
Cost-effective on-demand associative author name disambiguation

Information Processing and Management: an International Journal
A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
Active associative sampling for author name disambiguation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Citation-based bootstrapping for large-scale author disambiguation

Journal of the American Society for Information Science and Technology
AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB

Computational Intelligence
A brief survey of automatic methods for author name disambiguation

ACM SIGMOD Record
Ambiguous author query detection using crowdsourced digital library annotations

Information Processing and Management: an International Journal
Vietnamese author name disambiguation for integrating publications from heterogeneous sources

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
An automatic system for identifying authorities in digital libraries

Expert Systems with Applications: An International Journal
A relevance feedback approach for the author name disambiguation problem

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
de-linkability: a privacy-preserving constraint for safely outsourcing multimedia documents

Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems
Author name disambiguation in scientific collaboration and mobility cases

Scientometrics
Efficient supervised and semi-supervised approaches for affiliations disambiguation

Scientometrics
Robust hybrid name disambiguation framework for large databases

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.