Cost-effective on-demand associative author name disambiguation

Authors:
Adriano Veloso;Anderson A. Ferreira;Marcos André Gonçalves;Alberto H. F. Laender;Wagner Meira, Jr.
Affiliations:
Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil
Venue:
Information Processing and Management: an International Journal
Year:
2012

Citing 27
Cited 5

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Support-Vector Networks

Machine Learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Advances in frequent itemset mining implementations: report on FIMI'03

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles

Journal of the American Society for Information Science and Technology
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
A fast kernel-based multilevel algorithm for graph clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
An effective approach to entity resolution problem using quasi-clique and its application to digital libraries
Multi-evidence, multi-criteria, lazy associative document classification

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Lazy Associative Classification

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Using a knowledge base to disambiguate personal name in web search results

Proceedings of the 2007 ACM symposium on Applied computing
Efficient topic-based unsupervised name disambiguation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Are your citations clean?

Communications of the ACM
Learning to rank at query-time using association rules

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On co-authorship for author disambiguation

Information Processing and Management: an International Journal
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using web information for author name disambiguation

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Improving author coreference by resource-bounded information gathering from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations

Journal of the American Society for Information Science and Technology
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
Active associative sampling for author name disambiguation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
A brief survey of automatic methods for author name disambiguation

ACM SIGMOD Record
Ambiguous author query detection using crowdsourced digital library annotations

Information Processing and Management: an International Journal
A relevance feedback approach for the author name disambiguation problem

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform state-of-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.