Feature engineering and tree modeling for author-paper identification challenge

Authors:
Jiefei Li;Xiaocong Liang;Weijie Ding;Weidong Yang;Rong Pan
Affiliations:
Sun Yat-Sen University;Sun Yat-Sen University;Sun Yat-Sen University;Sun Yat-Sen University;Sun Yat-Sen University
Venue:
Proceedings of the 2013 KDD Cup 2013 Workshop
Year:
2013

Citing 6
Cited 0

Induction of fuzzy decision trees

Fuzzy Sets and Systems
An information-theoretic perspective of tf—idf measures

Information Processing and Management: an International Journal
AdaRank: a boosting algorithm for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Scikit-learn: Machine Learning in Python

The Journal of Machine Learning Research
Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints

SIAM Journal on Optimization
The Microsoft academic search dataset and KDD Cup 2013

Proceedings of the 2013 KDD Cup 2013 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom. Microsoft Academic Search is an open platform, which provides a variety of metrics and experiences for the research community, in addition to literature search. As the covering data came from many sources, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to others. KDD Cup 2013 Track 1 challenges participants to determine which papers in an author profile were truly written by the given author. In this work, we present how to use tree-base models to accurately predict the paper author. We incorporate feature engineering into the models with the advantages of them. This paper introduces two kinds of tree-base models (GB-DT [4], RGF [5]) and presents in detail the learning algorithm and how features can be generated for the task. The experimental results show the effectiveness of the proposed approach.