Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Authors:
Fabrizio Silvestri;Raffaele Perego;Salvatore Orlando
Affiliations:
University of Pisa - Italy;Information Science and Technology Institute (CNR), Pisa - Italy;University of Venice - Italy
Venue:
Proceedings of the 2004 ACM symposium on Applied computing
Year:
2004

Citing 14
Cited 9

Algorithms for clustering data

Algorithms for clustering data
In situ generation of compressed inverted files

Journal of the American Society for Information Science
Simulation of compressible flow on a massively parallel architecture

Scientific Programming - On applications analysis
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Memory expansion technology (MXT): competitive impact

IBM Journal of Research and Development

Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Compressed web indexes

Proceedings of the 18th international conference on World wide web
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Efficient query evaluation through access-reordering

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d_gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d_gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.