Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Authors:
Sharon Goldwater;Thomas L. Griffiths;Mark Johnson
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2011

Citing 28
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Modern Information Retrieval

Modern Information Retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A statistical parser for Czech

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
An algorithm for the unsupervised learning of morphology

Natural Language Engineering
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Morphology and reranking for the statistical parsing of Spanish

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Structured generative models for unsupervised named-entity clustering

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Inducing compact but accurate tree-substitution grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Induction of a simple morphology for highly-inflecting languages

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Unsupervised induction of natural language morphology inflection classes

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Unsupervised word segmentation for Sesotho using Adaptor Grammars

SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology
Bayesian learning of a tree substitution grammar

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
Variational inference for adaptor grammars

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Hierarchical Bayesian language models for conversational speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Inducing Tree-Substitution Grammars

The Journal of Machine Learning Research
Productivity and reuse in language

Productivity and reuse in language

Smoothing for bracketing induction

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Supervised N-gram topic model

Proceedings of the 7th ACM international conference on Web search and data mining
Bayesian Constituent Context Model for Grammar Induction

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes---the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process---that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.