Automatic text processing
Text compression
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Statistical methods for speech recognition
Statistical methods for speech recognition
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Statistical Language Learning
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
ACM '77 Proceedings of the 1977 annual conference
MARSYAS: a framework for audio analysis
Organised Sound
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hi-index | 0.00 |
The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq , and document frequency, df , but generalizes naturally to compute, df k (str ), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.