Citation data clustering for author name disambiguation

Authors:
Tomonari Masada;Atsuhiro Takasu;Jun Adachi
Affiliations:
Nagasaki University, Nagasaki, Japan;National Institute of Informatics, Chiyoda-ku, Tokyo, Japan;National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 6
Cited 1

A deterministic annealing approach to clustering

Pattern Recognition Letters
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing

Author name disambiguation for citations on the deep web

WAIM'10 Proceedings of the 2010 international conference on Web-age information management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new method of citation data clustering for author name disambiguation. Most citation data appearing in the reference section of scientific papers include the coauthor first names with their initials. Hence, we often search citation data by using such an abbreviated name, e.g. "S. Lee" or "J. Chen", and consequently obtain many irrelevant data in the search result, because such an abbreviated name refers to many different persons. In this paper, we propose a method of citation data clustering to construct clusters each of which includes only citation data corresponding to a unique author. Our clustering method is based on a probabilistic model which is an extension of the naive Bayes mixture model. Since our model has two hidden variables, we call it two-variable mixture model. In the evaluation experiment, we used the well-known DBLP data set. The results show that the two-variable mixture model can achieve a better balance between precision and recall than the naive Bayes mixture model.