Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

Authors:
Cher-Sheng Cheng;Chung-Ping Chung;Jean Jyh-Jiun Shann
Affiliations:
Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu 30050, Taiwan, ROC;Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu 30050, Taiwan, ROC;Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu 30050, Taiwan, ROC
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 13
Cited 1

Rearranging data to maximize the efficiency of compression

PODS '86 Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems
Parameterised compression for sparse bitmaps

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Applying informetric characteristics of databases to IR system file design, Part I: informetric models

Information Processing and Management: an International Journal - Special issue on Informetrics
Adding compression to a full-text retrieval system

Software—Practice & Experience
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Information retrieval on the web

ACM Computing Surveys (CSUR)
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
A Unique-Order Interpolative Code for Fast Querying and Space-Efficient Indexing in Information Retrieval Systems

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Run-length encodings (Corresp.)

IEEE Transactions on Information Theory

Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compressing an inverted file can greatly improve query performance of an information retrieval system (IRS) by reducing disk I/Os. We observe that a good document identifier assignment (DIA) can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. In this paper, we tackle the NP-complete problem of finding an optimal DIA to minimize the average query processing time in an IRS when the probability distribution of query terms is given. We indicate that the greedy nearest neighbor (Greedy-NN) algorithm can provide excellent performance for this problem. However, the Greedy-NN algorithm is inappropriate if used in large-scale IRSs, due to its high complexity O(N^2xn), where N denotes the number of documents and n denotes the number of distinct terms. In real-world IRSs, the distribution of query terms is skewed. Based on this fact, we propose a fast O(Nxn) heuristic, called partition-based document identifier assignment (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms, and improve compression efficiency of the posting lists for those terms. This can result in reduced query processing time. The experimental results show that the PBDIA algorithm can yield a competitive performance versus the Greedy-NN for the DIA problem, and that this optimization problem has significant advantages for both long queries and parallel information retrieval (IR).