Efficient feature weighting methods for ranking

  • Authors:
  • Hwanjo Yu;Jinoh Oh;Wook-Shin Han

  • Affiliations:
  • POSTECH, Pohang, South Korea;POSTECH, Pohang, South Korea;Kyungbuk National University, Daegu, South Korea

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature weighting or selection is a crucial process to identify an important subset of features from a data set. Removing irrelevant or redundant features can improve the generalization performance of ranking functions in information retrieval. Due to fundamental differences between classification and ranking, feature weighting methods developed for classification cannot be readily applied to feature weighting for ranking. A state of the art feature selection method for ranking, called GAS, has been recently proposed, which exploits importance of each feature and similarity between every pair of features. However, GAS must compute the similarity scores of all pairs of features, thus it is not scalable for high-dimensional data and its performance degrades on nonlinear ranking functions. This paper proposes novel algorithms, RankWrapper and RankFilter, which is scalable for high-dimensional data and also performs reasonably well on nonlinear ranking functions. RankWrapper and RankFilter are designed based on the key idea of Relief algorithm. Relief is a feature selection algorithm for classification, which exploits the notions of hits (data points within the same class) and misses (data points from different classes) for classification. However, there is no such notion of hits or misses in ranking. The proposed algorithms instead utilize the ranking distances of nearest data points in order to identify the key features for ranking. Our extensive experiments show that RankWrapper and RankFilter generate higher accuracy overall than the GAS and traditional Relief algorithms adapted for ranking, and run substantially faster than the GAS on high dimensional data.