Entry Pairing in Inverted File

Authors:
Hoang Thanh Lam;Raffaele Perego;Nguyen Thoi Quan;Fabrizio Silvestri
Affiliations:
Dip. di Informatica, Università di Pisa, Italy;ISTI-CNR, Pisa, Italy;Lomonosov Moscow State University, Russia;ISTI-CNR, Pisa, Italy
Venue:
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Year:
2009

Citing 17
Cited 1

An Efficient Implementation of Edmonds' Algorithm for Maximum Matching on Graphs

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Improving System Performance with Compressed Memory

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Inverted files for text search engines

ACM Computing Surveys (CSUR)
TSP and cluster-based solutions to the reassignment of document identifiers

Information Retrieval
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Static pruning of terms in inverted files

ECIR'07 Proceedings of the 29th European conference on IR research
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Linear time 1/2 -approximation algorithm for maximum weighted matching in general graphs

STACS'99 Proceedings of the 16th annual conference on Theoretical aspects of computer science

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes to exploit content and usage information to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i ) to compact a compressed inverted file built on an actual Web collection of documents, and (ii ) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.