Scalable, statistical storage allocation for extensible inverted file construction

  • Authors:
  • Robert W. P. Luk

  • Affiliations:
  • Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong

  • Venue:
  • Journal of Systems and Software
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

An Inverted file is a commonly used index for both archival databases and free text where no updates are expected. Applications like information filtering and dynamic environments like the Internet require inverted files to be updated efficiently. Recently, extensible inverted files are proposed which can be used for fast online indexing. The effective storage allocation scheme for such inverted files uses the arrival rate to preallocate storage. In this article, this storage allocation scheme is improved by using information about both the arrival rates and their variability to predict the storage needed, as well as scaling the storage allocation by a logarithmic factor. The resultant, final storage utilization rate can be as high as 97-98% after indexing about 1.6million documents. This compares favorably with the storage utilization rate of the original arrival rate storage allocation scheme. Our evaluation shows that the retrieval time for extensible inverted file on solid state disk is on average similar to the retrieval time for in-memory extensible inverted file. When file seek time is not an issue, our scalable storage allocation enables extensible inverted files to be used as the main index on disk. Our statistical storage allocation may be applicable to novel situations where the arrival of items follows a binomial, Poisson or normal distribution.