Non-contiguous word sequences for information retrieval

  • Authors:
  • Antoine Doucet;Helena Ahonen-Myka

  • Affiliations:
  • University of Helsinki, Finland;University of Helsinki, Finland

  • Venue:
  • MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The growing amount of textual information available electronically has increased the need for high performance retrieval. The use of phrases was long seen as a natural way to improve retrieval performance over the common document models that ignore the sequential aspect of word occurrences in documents, considering them as "bags of words". However, both statistical and syntactical phrases showed disappointing results for large document collections. In this paper we present a recent type of multi-word expressions in the form of Maximal Frequent Sequences (Ahonen-Myka, 1999). Mined phrases rather than statistical or syntactical phrases, their main strengths are to form a very compact index and to account for the sequentiality and adjacency of meaningful word co-occurrences, by allowing for a gap between words. We introduce a method for using these phrases in information retrieval and present our experiments. They show a clear improvement over the well-known technique of extracting frequent word pairs.