Efficient multi-word expressions extractor using suffix arrays and related structures

  • Authors:
  • José Aires;Gabriel Lopes;Joaquim Ferreira Silva

  • Affiliations:
  • Universidade Nova de Lisboa, Lisboa, Portugal;Universidade Nova de Lisboa, Lisboa, Portugal;Universidade Nova de Lisboa, Lisboa, Portugal

  • Venue:
  • Proceedings of the 2nd ACM workshop on Improving non english web searching
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

For Information Retrieval purposes, there is a need for regularly processing predictably dynamic and potentially huge corpora for extraction of contiguous Multi Word Expressions (MWEs), in a way that should be computationally tractable. In this paper we'll be mainly exploring the use of Suffix Arrays, together with the SCP association measure and the Smoothed LocalMaxs algorithm. The choice of Suffix Arrays and the construction of auxiliary structures enabled a clear minimization of the time for extracting multi-word expressions, with linear complexity by the introduction of a limitation on the number of words. Despite the methodology being essentially of a statistical nature, we show how to handle hybrid extraction mechanisms.