Using masks, suffix array-based data structures and multidimensional arrays to compute positional ngram statistics from corpora

Authors:
Alexandre Gil;Gaël Dias
Affiliations:
New University of Lisbon, Caparica, Portugal;Beira Interior University, Covilhã, Portugal
Venue:
MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Year:
2003

Citing 5
Cited 5

Engineering a sort function

Software—Practice & Experience
Implementing radixsort

Journal of Experimental Algorithmics (JEA)
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
MARSYAS: a framework for audio analysis

Organised Sound

A nonparametric method for extraction of candidate phrasal terms

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A Hybrid Approach to Improve Bilingual Multiword Expression Extraction

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Universal Mobile Information Retrieval

UAHCI '09 Proceedings of the 5th International on ConferenceUniversal Access in Human-Computer Interaction. Part II: Intelligent and Ubiquitous Interaction Environments
Pauses as an indicator of psycholinguistically valid multi-word expressions (MWEs)?

MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
A hybrid framework to extract bilingual multiword expression from free text

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context.