Index Compression through Document Reordering

Authors:
Dan Blandford;Guy Blelloch
Affiliations:
-;-
Venue:
DCC '02 Proceedings of the Data Compression Conference
Year:
2002

Citing 6
Cited 37

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Subquadratic approximation algorithms for clustering problems in high dimensional spaces

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Exploiting clustering in inverted file Compression

DCC '96 Proceedings of the Conference on Data Compression
Modeling word occurrences for the compression of concordances

DCC '95 Proceedings of the Conference on Data Compression
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference

Compact representations of separable graphs

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Proceedings of the 2004 ACM symposium on Applied computing
Compact representations of ordered sets

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Document Classification Based on the Topic Evaluation and Its Usage in Data Compression

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Permuting Web Graphs

WAW '09 Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Compressed web indexes

Proceedings of the 18th international conference on World wide web
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Entry Pairing in Inverted File

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An improved competitive algorithm for reordering buffer management

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
Almost tight bounds for reordering buffer management

Proceedings of the forty-third annual ACM symposium on Theory of computing
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Effect of different docid orderings on dynamic pruning retrieval strategies

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient query evaluation through access-reordering

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
A software architecture for effective document identifier reassignment

EUROCAST'05 Proceedings of the 10th international conference on Computer Aided Systems Theory
Document identifier reassignment through dimensionality reduction

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Optimal online buffer scheduling for block devices

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
A bicriteria approximation for the reordering buffer problem

ESA'12 Proceedings of the 20th Annual European conference on Algorithms
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment
Using rating matrix compression techniques to speed up collaborative recommendations

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important concern in the design of search engines is the construction of an inverted index. An inverted index, also called a concordance, contains a list of documents (or posting list) for every possible search term. These posting lists are usually compressed with difference coding. Difference coding yields the best compression when the lists to be coded have high locality. Coding methods have been designed to specifically take advantage of locality in inverted indices. Here, we describe an algorithm to permute the document numbers so as to create locality in an inverted index. This is done by clustering the documents. Our algorithm, when applied to the TREC ad hoc database (disks 4 and 5), improves the performance of the best difference coding algorithm we found by fourteen percent. The improvement increases as the size of the index increases, so we expect that greater improvements would be possible on larger datasets.