Extension of Zipf's law to words and phrases

Authors:
Le Quan Ha;E. I. Sicilia-Garcia;Ji Ming;F. J. Smith
Affiliations:
Queen's University of Belfast, Belfast, Northern Ireland;Queen's University of Belfast, Belfast, Northern Ireland;Queen's University of Belfast, Belfast, Northern Ireland;Queen's University of Belfast, Belfast, Northern Ireland
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 2
Cited 14

Storing and retrieving word phrases

Information Processing and Management: an International Journal
The design for the wall street journal-based CSR corpus

HLT '91 Proceedings of the workshop on Speech and Natural Language

Maximum Likelihood Set for Estimating a Probability Mass Function

Neural Computation
A nonparametric method for extraction of candidate phrasal terms

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A comparison of document, sentence, and term event spaces

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Reduced n-gram models for English and Chinese corpora

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Characteristics of character usage in Chinese Web searching

Information Processing and Management: an International Journal
A signal-to-noise approach to score normalization

Proceedings of the 18th ACM conference on Information and knowledge management
Active learning for multilingual statistical machine translation

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Long story short - Global unsupervised models for keyphrase based meeting summarization

Speech Communication
A corpus of Australian contract language: description, profiling and analysis

Proceedings of the 13th International Conference on Artificial Intelligence and Law
A statistical test for grammar

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Zipf's law and mandelbrot's constants for turkish language using turkish corpus (turco)

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Pattern mining across domain-specific text collections

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Learning to extract chemical names based on random text generation and incomplete dictionary

Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for n-gram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf's law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.