A Bayesian mixture model for term re-occurrence and burstiness

  • Authors:
  • Avik Sarkar;Paul H. Garthwaite;Anne De Roeck

  • Affiliations:
  • The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK

  • Venue:
  • CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term's re-occurrence rate and within-document burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term's importance based on its distribution pattern in the corpus.