Scalable techniques for document identifier assignment in inverted indexes

Authors:
Shuai Ding;Josh Attenberg;Torsten Suel
Affiliations:
Polytechnic Institute of NYU, NY, USA;Polytechnic Institute of NYU, NY, USA;Polytechnic Institute of NYU, NY, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 24
Cited 12

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Keeping Up with the Changing Web

Computer
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
TSP and cluster-based solutions to the reassignment of document identifiers

Information Retrieval
Approximate maximum weight branchings

Information Processing Letters
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Compressing large boolean matrices using reordering techniques

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)

The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Document identifier reassignment through dimensionality reduction

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Compressed data structures for annotated web search

Proceedings of the 21st international conference on World Wide Web
Reordering rows for better compression: Beyond the lexicographic order

ACM Transactions on Database Systems (TODS)
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A candidate filtering mechanism for fast top-k query processing on modern cpus

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment
Using rating matrix compression techniques to speed up collaborative recommendations

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.