On the interrelationship of dictionary size and completeness

  • Authors:
  • H. Hüther

  • Affiliations:
  • Galgenbergstr- 13, D-6654 Kirkel-Limbach, West Germany

  • Venue:
  • SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 1989

Quantified Score

Hi-index 0.00

Visualization

Abstract

When dictionaries for specific applications or subject fields are derived from a text collection, the frequency distribution of the terms in the collection gives information about the expected completeness of the dictionary. If only a subset of the terms in the collection is to be included in the dictionary, the completeness of the dictionary can be optimized with respect to dictionary size.In this paper, formulas for the relationship between the frequency distribution of the terms in the collection and expected dictionary completeness are derived. First we regard one-dimensional dictionaries where the (non-trivial) terms occurring in the texts are to be included in the dictionary. Then we describe the case of two-dimensional dictionaries, which are needed for example for automatic indexing with a controlled vocabulary; here relationships between text terms and descriptors from the prescribed vocabulary have to be stored in the dictionary. For both cases, formulas for the interpolation and extrapolation with respect to different collection sizes are derived.We give experimental results for one-dimensional dictionaries and show how the completeness can be estimated and optimized.