Towards optimum query segmentation: in doubt without

  • Authors:
  • Matthias Hagen;Martin Potthast;Anna Beyer;Benno Stein

  • Affiliations:
  • Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem are threefold. (1) We conduct the first large-scale study of human segmentation behavior based on more than 500000 segmentations. (2) We show that the traditionally applied segmentation accuracy measures are not appropriate for such large-scale corpora and introduce new, more robust measures. (3) We develop a new query segmentation approach with the basic idea that, in cases of doubt, it is often better to (partially) leave queries without any segmentation. This new in-doubt-without approach chooses different segmentation strategies depending on query types. A large-scale evaluation shows substantial improvement upon the state of the art in terms of segmentation accuracy. To draw a complete picture, we also evaluate the impact of segmentation strategies on retrieval performance in a TREC setting. It turns out that more accurate segmentation not necessarily yields better retrieval performance. Based on this insight, we propose an in-doubt-without variant which achieves the best retrieval performance despite leaving many queries unsegmented. But there is still room for improvement: the optimum segmentation strategy which always chooses the segmentation that maximizes retrieval performance, significantly outperforms all other tested approaches.