N-Poisson document modelling

  • Authors:
  • Eugene L. Margulis

  • Affiliations:
  • Institut für Informationssysteme, ETH Zentrum, CH-8092 Zürich, Switzerland

  • Venue:
  • SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper is a report of a study investigating the validity of the Multiple Poisson (nP) model of word distribution in document collections. An nP distribution is a mixture of n Poisson distributions with different means. We describe a practical algorithm for determining if a certain word is distributed acording to an nP distribution and computing the distribution parameters. The algorithm was applied to every word in four different document collections. It was found that over 70% of frequently occurring words and terms indeed behave according to the nP distributions. The results indicate that the proportion of nP words depends on the collection size, document length and the frequency of the individual words. Most of the nP words recognised are distributed according to the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture depends on the collection frequency of words.