Statistical substring reduction in linear time

Authors:
Xueqiang Lü;Le Zhang;Junfeng Hu
Affiliations:
Institute of Computational Linguistics, Peking University, Beijing;Institute of Computer Software & Theory, Northeastern University, Shenyang;Institute of Computational Linguistics, Peking University, Beijing
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 1
Cited 5

A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
A new unsupervised approach to word segmentation

Computational Linguistics
Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Making objective decisions from subjective data: Detecting irony in customer reviews

Decision Support Systems
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of efficiently removing equal frequency n-gram substrings from an n-gram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multi-word unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time (O(n)) complexity, and prove its equivalence with the traditional O(n2) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O(n2) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.