A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Authors:
Anderson A. Ferreira;Marcos André Gonçalves;Jussara M. Almeida;Alberto H. F. Laender;Adriano Veloso
Affiliations:
Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil and Departamento de Computação, Universidade Federal de Ouro Preto, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil
Venue:
Information Sciences: an International Journal
Year:
2012

Citing 44
Cited 1

Support-Vector Networks

Machine Learning
Random Forests

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
BDBComp: building a digital library for the Brazilian computer science community

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles

Journal of the American Society for Information Science and Technology
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An effective approach to entity resolution problem using quasi-clique and its application to digital libraries
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Using a knowledge base to disambiguate personal name in web search results

Proceedings of the 2007 ACM symposium on Applied computing
SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Efficient topic-based unsupervised name disambiguation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A parallel general-purpose synthetic data generator

ACM SIGMOD Record
Approximate personal name-matching through finite-state graphs

Journal of the American Society for Information Science and Technology
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Author Name Disambiguation for Citations Using Topic and Web Correlation

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
On co-authorship for author disambiguation

Information Processing and Management: an International Journal
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using web information for author name disambiguation

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Improving author coreference by resource-bounded information gathering from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
SyGAR: a synthetic data generator for evaluating name disambiguation methods

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
So near and yet so far: New insight into properties of some well-known classifier paradigms

Information Sciences: an International Journal
Person name disambiguation by bootstrapping

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations

Journal of the American Society for Information Science and Technology
On Graph-Based Name Disambiguation

Journal of Data and Information Quality (JDIQ)
Author name disambiguation

Annual Review of Information Science and Technology
A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments

Journal of the American Society for Information Science and Technology
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
A generic Web-based entity resolution framework

Journal of the American Society for Information Science and Technology
Construction of a large-scale test set for author disambiguation

Information Processing and Management: an International Journal
Calibrated lazy associative classification

Information Sciences: an International Journal
Resolving author name homonymy to improve resolution of structures in co-author networks

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Cost-effective on-demand associative author name disambiguation

Information Processing and Management: an International Journal

An automatic system for identifying authorities in digital libraries

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

The author name disambiguation task has to deal with uncertainties related to the possible many-to-many correspondences between ambiguous names and unique authors. Despite the variety of name disambiguation methods available in the literature to solve the problem, most of them are rarely compared against each other. Moreover, they are often evaluated without considering a time evolving digital library, susceptible to dynamic (and therefore challenging) patterns such as the introduction of new authors and the change of researchers' interests over time. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, in this article we propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various publication patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection generated with the tool, considering several dynamic and realistic scenarios.