Efficient multi-word expressions extractor using suffix arrays and related structures

Authors:
José Aires;Gabriel Lopes;Joaquim Ferreira Silva
Affiliations:
Universidade Nova de Lisboa, Lisboa, Portugal;Universidade Nova de Lisboa, Lisboa, Portugal;Universidade Nova de Lisboa, Lisboa, Portugal
Venue:
Proceedings of the 2nd ACM workshop on Improving non english web searching
Year:
2008

Citing 10
Cited 2

Word association norms, mutual information, and lexicography

Computational Linguistics
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
MARSYAS: a framework for audio analysis

Organised Sound
Termight: identifying and translating technical terminology

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Retrieving collocations by co-occurrences and word order constraints

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Clustering Syntactic Positions with Similar Semantic Requirements

Computational Linguistics
Identification of Document Language is Not yet a Completely Solved Problem

CIMCA '06 Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce
Accurate collocation extraction using a multilingual parser

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Mining large-scale comparable corpora from Chinese-English news collections

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

For Information Retrieval purposes, there is a need for regularly processing predictably dynamic and potentially huge corpora for extraction of contiguous Multi Word Expressions (MWEs), in a way that should be computationally tractable. In this paper we'll be mainly exploring the use of Suffix Arrays, together with the SCP association measure and the Smoothed LocalMaxs algorithm. The choice of Suffix Arrays and the construction of auxiliary structures enabled a clear minimization of the time for extracting multi-word expressions, with linear complexity by the introduction of a limitation on the number of words. Despite the methodology being essentially of a statistical nature, we show how to handle hybrid extraction mechanisms.