N-Poisson document modelling

Authors:
Eugene L. Margulis
Affiliations:
Institut für Informationssysteme, ETH Zentrum, CH-8092 Zürich, Switzerland
Venue:
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
1992

Citing 3
Cited 10

Polychannel systems for mass digital communications

Communications of the ACM
On generalizing the Two-Poisson model

Journal of the American Society for Information Science
Probabilistic models of indexing and searching

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval

“Is this document relevant?…probably”: a survey of probabilistic models in information retrieval

ACM Computing Surveys (CSUR)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
A frequency-based and a poisson-based definition of the probability of being informative

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A parallel derivation of probabilistic information retrieval models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Interpreting TF-IDF term weights as making relevance decisions

ACM Transactions on Information Systems (TOIS)
Part of Speech Based Term Weighting for Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
A probabilistic framework for automatic term recognition

Intelligent Data Analysis
Aggregative query generation

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Feature subspace selection for efficient video retrieval

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is a report of a study investigating the validity of the Multiple Poisson (nP) model of word distribution in document collections. An nP distribution is a mixture of n Poisson distributions with different means. We describe a practical algorithm for determining if a certain word is distributed acording to an nP distribution and computing the distribution parameters. The algorithm was applied to every word in four different document collections. It was found that over 70% of frequently occurring words and terms indeed behave according to the nP distributions. The results indicate that the proportion of nP words depends on the collection size, document length and the frequency of the individual words. Most of the nP words recognised are distributed according to the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture depends on the collection frequency of words.