Substring Statistics

Authors:
Kyoji Umemura;Kenneth Church
Affiliations:
Toyohashi University of Technology, Tempaku, Toyohashi, Japan 441-8580;Microsoft, One Microsoft Way, Redmond, USA 98052
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 13
Cited 0

Automatic text processing

Automatic text processing
Text compression

Text compression
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Statistical methods for speech recognition

Statistical methods for speech recognition
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Statistical Language Learning

Statistical Language Learning
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO

ACM '77 Proceedings of the 1977 annual conference
MARSYAS: a framework for audio analysis

Organised Sound
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq , and document frequency, df , but generalizes naturally to compute, df k (str ), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.