Document vector representations for feature extraction in multi-stage document ranking

  • Authors:
  • Nima Asadi;Jimmy Lin

  • Affiliations:
  • Department of Computer Science, University of Maryland, College Park, USA;The iSchool, College of Information Studies, University of Maryland, College Park, USA

  • Venue:
  • Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider a multi-stage retrieval architecture consisting of a fast, "cheap" candidate generation stage, a feature extraction stage, and a more "expensive" reranking stage using machine-learned models. In this context, feature extraction can be accomplished using a document vector index, a mapping from document ids to document representations. We consider alternative organizations of such a data structure for efficient feature extraction: design choices include how document terms are organized, how complex term proximity features are computed, and how these structures are compressed. In particular, we propose a novel document-adaptive hashing scheme for compactly encoding term ids. The impact of alternative designs on both feature extraction speed and memory footprint is experimentally evaluated. Overall, results show that our architecture is comparable in speed to using a traditional positional inverted index but requires less memory overall, and offers additional advantages in terms of flexibility.