Extending Zipf's law to n-grams for large corpora

Authors:
Le Quan Ha;Philip Hanna;Ji Ming;F. J. Smith
Affiliations:
Computer Science Branch, Hochiminh City University of Industry, Ministry of Industry and Trade, Hochiminh City, Vietnam;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, Northern Ireland BT7 1NN;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, Northern Ireland BT7 1NN;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, Northern Ireland BT7 1NN
Venue:
Artificial Intelligence Review
Year:
2009

Citing 7
Cited 0

Storing and retrieving word phrases

Information Processing and Management: an International Journal
On the law of Zipf-Mandelbrot for multi-word phrases

Journal of the American Society for Information Science
A stochastic process for word frequency distributions

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Maximum Likelihood Set for Estimating a Probability Mass Function

Neural Computation
A nonparametric method for extraction of candidate phrasal terms

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A comparison of document, sentence, and term event spaces

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Characteristics of character usage in Chinese Web searching

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Experiments show that for a large corpus, Zipf's law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf's law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf's law with a slope close to 驴1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf's law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory alone can predict this behavior in randomly created n-grams of binary bits.