Substring Statistics

  • Authors:
  • Kyoji Umemura;Kenneth Church

  • Affiliations:
  • Toyohashi University of Technology, Tempaku, Toyohashi, Japan 441-8580;Microsoft, One Microsoft Way, Redmond, USA 98052

  • Venue:
  • CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq , and document frequency, df , but generalizes naturally to compute, df k (str ), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.