A semi-supervised approach for author disambiguation in KDD CUP 2013

Authors:
Jianyu Zhao;Peng Wang;Kai Huang
Affiliations:
Southeast University, Nanjing, P.R. China;Southeast University, Nanjing, P.R. China;Southeast University, Nanjing, P.R. China
Venue:
Proceedings of the 2013 KDD Cup 2013 Workshop
Year:
2013

Citing 15
Cited 0

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Latent dirichlet allocation

The Journal of Machine Learning Research
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Search engine driven author disambiguation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Author Name Disambiguation for Citations Using Topic and Web Correlation

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Combining machine learning and human judgment in author disambiguation

Proceedings of the 20th ACM international conference on Information and knowledge management
A Unified Probabilistic Framework for Name Disambiguation in Digital Library

IEEE Transactions on Knowledge and Data Engineering
Author disambiguation using multi-aspect similarity indicators

Scientometrics
A brief survey of automatic methods for author name disambiguation

ACM SIGMOD Record
The Microsoft academic search dataset and KDD Cup 2013

Proceedings of the 2013 KDD Cup 2013 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Name disambiguation, which aims to identify multiple names which correspond to one person and same names which refer to different persons, is one of the most important basic problems in many areas such as natural language processing, information retrieval and digital libraries. Microsoft academic search data in KDD Cup 2013 Track 2 task brings one such challenge to the researchers in the knowledge discovery and data mining community. Besides the real-world and large-scale characteristic, the Track 2 task raises several challenges: (1) Consideration of both synonym and polysemy problems; (2) Existence of huge amount of noisy data with missing attributes; (3) Absence of labeled data that makes this challenge a cold start problem. In this paper, we describe our solution to Track 2 of KDD Cup 2013. The challenge of this track is author disambiguation, which aims at identifying whether authors are the same person by using academic publication data. We propose a multi-phase semi-supervised approach to deal with the challenge. First, we preprocess the dataset and generate features for models, then construct a coauthor-based network and employ community detection to accomplish first-phase disambiguation task, which handles the cold-start problem. Second, using results in first phase, we use support vector machine and various other models to utilize noisy data with missing attributes in the dataset. Further, we propose a self-taught procedure to solve ambiguity in coauthor information, boosting performance of results from other models. Finally, by blending results from different models, we finally achieves 6th place with 0.98717 mean F-score on public leaderboard and 7th place with 0.98651 mean F-score on private leaderboard.