Contextual rule-based feature engineering for author-paper identification

Authors:
Erheng Zhong;Lianghao Li;Naiyan Wang;Ben Tan;Yin Zhu;Lili Zhao;Qiang Yang
Affiliations:
Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong and Huawei Noah's Ark Lab, Hong Kong
Venue:
Proceedings of the 2013 KDD Cup 2013 Workshop
Year:
2013

Citing 7
Cited 0

Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Learning to Rank for Information Retrieval

Foundations and Trends in Information Retrieval
Stochastic gradient boosted distributed decision trees

Proceedings of the 18th ACM conference on Information and knowledge management
A Unified Probabilistic Framework for Name Disambiguation in Digital Library

IEEE Transactions on Knowledge and Data Engineering
The Microsoft academic search dataset and KDD Cup 2013

Proceedings of the 2013 KDD Cup 2013 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the ideas and methodologies that we used to address the KDD Cup 2013 challenge on author-paper identification. We firstly formulate the problem as a personalized ranking task and then propose to solve the task through a supervised learning framework. The key point is to eliminate those incorrectly assigned papers of a given author based on existing records. We choose Gradient Boosted Tree as our main classifier. Through our exploration we conclude that the most critical factor to achieve our results is the effective feature engineering. In this paper, we formulate this process as a unified framework that constructs features based on contextual information and combines machine learning techniques with human intelligence. Besides this, we suggest several strategies to parse authors' names, which improve the prediction results significantly. Divide-conquer based model building as well as the model averaging techniques also benefit the prediction precision.