Software—Practice & Experience
Journal of Experimental Algorithmics (JEA)
Fast algorithms for sorting and searching strings
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Suffix arrays: a new method for on-line string searches
SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
MARSYAS: a framework for audio analysis
Organised Sound
A nonparametric method for extraction of candidate phrasal terms
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A Hybrid Approach to Improve Bilingual Multiword Expression Extraction
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Universal Mobile Information Retrieval
UAHCI '09 Proceedings of the 5th International on ConferenceUniversal Access in Human-Computer Interaction. Part II: Intelligent and Ubiquitous Interaction Environments
Pauses as an indicator of psycholinguistically valid multi-word expressions (MWEs)?
MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
A hybrid framework to extract bilingual multiword expression from free text
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context.