Selecting optimal training data for learning to rank

Authors:
Xiubo Geng;Tao Qin;Tie-Yan Liu;Xue-Qi Cheng;Hang Li
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, No. 6, Kexueyuan South Road, Zhongguancun, Haidian District, Beijing 100190, PR China;Microsoft Research Asia, No. 49, Zhichun Road, Haidian District, Beijing 100190, PR China;Microsoft Research Asia, No. 49, Zhichun Road, Haidian District, Beijing 100190, PR China;Institute of Computing Technology, Chinese Academy of Sciences, No. 6, Kexueyuan South Road, Zhongguancun, Haidian District, Beijing 100190, PR China;Microsoft Research Asia, No. 49, Zhichun Road, Haidian District, Beijing 100190, PR China
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 15
Cited 1

Partitioning sparse matrices with eigenvectors of graphs

SIAM Journal on Matrix Analysis and Applications
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Boosting in the limit: maximizing the margin of learned ensembles

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient boosting algorithm for combining preferences

The Journal of Machine Learning Research
Discriminative models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Learning to rank: from pairwise approach to listwise approach

Proceedings of the 24th international conference on Machine learning
Query dependent ranking using K-nearest neighbor

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving quality of training data for learning to rank using click-through data

Proceedings of the third ACM international conference on Web search and data mining
Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

The Journal of Machine Learning Research

Clustering-based transduction for learning a ranking model with limited human labels

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of ''pairwise preference consistency'' (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.