n-Grams and their implication to natural language understanding
Pattern Recognition
Highlights: language- and domain-independent automatic indexing terms for abstracting
Journal of the American Society for Information Science
On the law of Zipf-Mandelbrot for multi-word phrases
Journal of the American Society for Information Science
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Mathematical and Computer Modelling: An International Journal
Hi-index | 0.98 |
This paper establishes the general relation between the distribution of N-tuples of letters (e.g., N-truncations, N-grams) or words (e.g., N-word phrases) and the distributions of the single letters or words. Here the very general case is treated: the case where there is dependence on the place i in the N-tuple (i = 1,..., N) in the sense that, for each i = 1,..., N, a different distribution of the letters or words is supposed. Concrete calculations are performed in the important case of Zipfian distributions (i.e., power laws) for the single letters or words. In this case, we prove that the distribution of the N-tuples (N-fixed) is the sum of power laws.