A Bayesian mixture model for term re-occurrence and burstiness

Authors:
Avik Sarkar;Paul H. Garthwaite;Anne De Roeck
Affiliations:
The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK
Venue:
CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Year:
2005

Citing 6
Cited 4

A new method of weighting query terms for ad-hoc retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
Independence assumptions considered harmful

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Empirical term weighting and expansion frequency

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Statistical properties of inter-arrival times distribution in social tagging systems

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Terminology mining in social media

Proceedings of the 18th ACM conference on Information and knowledge management
Statistical simulation and the distribution of distances between identical elements in a random sequence

Computational Statistics & Data Analysis
Language technology for elearning

EC-TEL'06 Proceedings of the First European conference on Technology Enhanced Learning: innovative Approaches for Learning and Knowledge Sharing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term's re-occurrence rate and within-document burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term's importance based on its distribution pattern in the corpus.