Query segmentation revisited

  • Authors:
  • Matthias Hagen;Martin Potthast;Benno Stein;Christof Bräutigam

  • Affiliations:
  • Bauhaus-Universität, Weimar, Germany;Bauhaus-Universität, Weimar, Germany;Bauhaus-Universität, Weimar, Germany;Bauhaus-Universität, Weimar, Germany

  • Venue:
  • Proceedings of the 20th international conference on World wide web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented. The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50,000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now.