Efficient supervised and semi-supervised approaches for affiliations disambiguation

Authors:
Pascal Cuxac;Jean-Charles Lamirel;Valerie Bonvallot
Affiliations:
INIST-CNRS, Vandoeuvre les Nancy, France;LORIA-Synalp, Vandoeuvre les Nancy, France;INIST-CNRS, Vandoeuvre les Nancy, France
Venue:
Scientometrics
Year:
2013

Citing 6
Cited 1

Using clustering strategies for creating authority files

Journal of the American Society for Information Science
Citation Analysis in Research Evaluation (Information Science & Knowledge Management)

Citation Analysis in Research Evaluation (Information Science & Knowledge Management)
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Affiliation disambiguation for constructing semantic digital libraries

Journal of the American Society for Information Science and Technology
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Mining information for instance unification

ISWC'06 Proceedings of the 5th international conference on The Semantic Web

What do university rankings by fields rank? Exploring discrepancies between the organizational structure of universities and bibliometric classifications

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web驴etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions驴 Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.