Computational Statistics & Data Analysis - Nonlinear methods and data mining
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Learning to Rank for Information Retrieval
Foundations and Trends in Information Retrieval
Stochastic gradient boosted distributed decision trees
Proceedings of the 18th ACM conference on Information and knowledge management
A Unified Probabilistic Framework for Name Disambiguation in Digital Library
IEEE Transactions on Knowledge and Data Engineering
The Microsoft academic search dataset and KDD Cup 2013
Proceedings of the 2013 KDD Cup 2013 Workshop
Hi-index | 0.00 |
We present the ideas and methodologies that we used to address the KDD Cup 2013 challenge on author-paper identification. We firstly formulate the problem as a personalized ranking task and then propose to solve the task through a supervised learning framework. The key point is to eliminate those incorrectly assigned papers of a given author based on existing records. We choose Gradient Boosted Tree as our main classifier. Through our exploration we conclude that the most critical factor to achieve our results is the effective feature engineering. In this paper, we formulate this process as a unified framework that constructs features based on contextual information and combines machine learning techniques with human intelligence. Besides this, we suggest several strategies to parse authors' names, which improve the prediction results significantly. Divide-conquer based model building as well as the model averaging techniques also benefit the prediction precision.