Scalable, statistical storage allocation for extensible inverted file construction

Authors:
Robert W. P. Luk
Affiliations:
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Venue:
Journal of Systems and Software
Year:
2011

Citing 15
Cited 0

Optimization for dynamic inverted index maintenance

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
From E-Sex to E-Commerce: Web Search Changes

Computer
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
A statistics-based approach to incrementally update inverted files

Information Processing and Management: an International Journal
Indexing time vs. query time: trade-offs in dynamic information retrieval systems

Proceedings of the 14th ACM international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient online index maintenance for contiguous inverted lists

Information Processing and Management: an International Journal
Efficient in-memory extensible inverted file

Information Systems
Hybrid index maintenance for contiguous inverted lists

Information Retrieval
Efficient online index construction for text databases

ACM Transactions on Database Systems (TODS)
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
On single-pass indexing with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

An Inverted file is a commonly used index for both archival databases and free text where no updates are expected. Applications like information filtering and dynamic environments like the Internet require inverted files to be updated efficiently. Recently, extensible inverted files are proposed which can be used for fast online indexing. The effective storage allocation scheme for such inverted files uses the arrival rate to preallocate storage. In this article, this storage allocation scheme is improved by using information about both the arrival rates and their variability to predict the storage needed, as well as scaling the storage allocation by a logarithmic factor. The resultant, final storage utilization rate can be as high as 97-98% after indexing about 1.6million documents. This compares favorably with the storage utilization rate of the original arrival rate storage allocation scheme. Our evaluation shows that the retrieval time for extensible inverted file on solid state disk is on average similar to the retrieval time for in-memory extensible inverted file. When file seek time is not an issue, our scalable storage allocation enables extensible inverted files to be used as the main index on disk. Our statistical storage allocation may be applicable to novel situations where the arrival of items follows a binomial, Poisson or normal distribution.