The exact rank-frequency function and size-frequency function of N-grams and N-word phrases with applications

Authors:
L. Egghe
Affiliations:
Limburgs Universitair Centrum (LUC), Universitaire Campus B-3590 Diepenbeek, Belgium and Universiteit Antwerpen (UA), Campus Drie Eiken, Universiteitsplein 1 B-2610 Wilrijk, Belgium
Venue:
Mathematical and Computer Modelling: An International Journal
Year:
2005

Citing 8
Cited 1

The duality of informetric systems with applications to the empirical laws

Journal of Information Science
Relations between continuous versions of bibliometric laws

Journal of the American Society for Information Science
n-Grams and their implication to natural language understanding

Pattern Recognition
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
On the law of Zipf-Mandelbrot for multi-word phrases

Journal of the American Society for Information Science
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
Type/token-taken informetrics

Journal of the American Society for Information Science and Technology
General study of the distribution of N-tuples of letters or words based on the distributions of the single letters or words

Mathematical and Computer Modelling: An International Journal

Properties of the n-overlap vector and n-overlap similarity theory: Research Articles

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.98

Visualization

Abstract

N-grams are generalized words consisting of N consecutive symbols (letters), as they are used in a text. N-word phrases are general concepts consisting of N consecutive words, also as used in a text. Given the rank-frequency function of single letters (i.e., one-grams) or of single words (i.e., one-word phrases) being Zipfian, we determine in this paper, the exact rank-frequency function (i.e., the occurrence of N-grams or N-word phrases on each rank) and size-frequency distribution (i.e., the density of N-grams or N-word phrases on each occurrence density) of these N-grams and N-word phrases. This paper distinguishes itself from other ones on this topic by allowing no approximations in the calculations. This leads to an intricate rank-frequency function for N-grams and N-word phrases (as we knew before from unpublished calculations) but leads surprisingly, to a very simple size-frequency function f"N for N-grams or N-word phrases of the formf"N(j)=Fj^1^+^1^/^@bln^N^-^1(Gj), where the Zipfian distribution of single letters or words is proportional to 1/r^@b. The paper closes with the calculation of type/token averages @m"N and type/token-taken averages @m*"N for N-grams and N-word phrases, where we also verify the theoretically proved result @m*"N = @m"N but where we also give estimates for the differences @m*"N - @m"N.