A semi-supervised approach for author disambiguation in KDD CUP 2013

  • Authors:
  • Jianyu Zhao;Peng Wang;Kai Huang

  • Affiliations:
  • Southeast University, Nanjing, P.R. China;Southeast University, Nanjing, P.R. China;Southeast University, Nanjing, P.R. China

  • Venue:
  • Proceedings of the 2013 KDD Cup 2013 Workshop
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Name disambiguation, which aims to identify multiple names which correspond to one person and same names which refer to different persons, is one of the most important basic problems in many areas such as natural language processing, information retrieval and digital libraries. Microsoft academic search data in KDD Cup 2013 Track 2 task brings one such challenge to the researchers in the knowledge discovery and data mining community. Besides the real-world and large-scale characteristic, the Track 2 task raises several challenges: (1) Consideration of both synonym and polysemy problems; (2) Existence of huge amount of noisy data with missing attributes; (3) Absence of labeled data that makes this challenge a cold start problem. In this paper, we describe our solution to Track 2 of KDD Cup 2013. The challenge of this track is author disambiguation, which aims at identifying whether authors are the same person by using academic publication data. We propose a multi-phase semi-supervised approach to deal with the challenge. First, we preprocess the dataset and generate features for models, then construct a coauthor-based network and employ community detection to accomplish first-phase disambiguation task, which handles the cold-start problem. Second, using results in first phase, we use support vector machine and various other models to utilize noisy data with missing attributes in the dataset. Further, we propose a self-taught procedure to solve ambiguity in coauthor information, boosting performance of results from other models. Finally, by blending results from different models, we finally achieves 6th place with 0.98717 mean F-score on public leaderboard and 7th place with 0.98651 mean F-score on private leaderboard.