Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression
Information Retrieval
Keeping Up with the Changing Web
Computer
Cluster-Based Delta Compression of a Collection of Files
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Inverted file compression through document identifier reassignment
Information Processing and Management: an International Journal
Index Compression through Document Reordering
DCC '02 Proceedings of the Data Compression Conference
Assigning identifiers to documents to enhance the clustering property of fulltext indexes
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted Index Compression Using Word-Aligned Binary Codes
Information Retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines
ACM Computing Surveys (CSUR)
TSP and cluster-based solutions to the reassignment of document identifiers
Information Retrieval
Approximate maximum weight branchings
Information Processing Letters
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Compressing large boolean matrices using reordering techniques
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Performance of compressed inverted list caching in search engines
Proceedings of the 17th international conference on World Wide Web
The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)
The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)
Challenges in building large-scale information retrieval systems: invited talk
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
Document identifier reassignment through dimensionality reduction
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Data structures: time, I/Os, entropy, joules!
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Batch query processing for web search engines
Proceedings of the fourth ACM international conference on Web search and data mining
Inverted index compression via online document routing
Proceedings of the 20th international conference on World wide web
Faster top-k document retrieval using block-max indexes
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Indexes for highly repetitive document collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Compressed data structures for annotated web search
Proceedings of the 21st international conference on World Wide Web
Reordering rows for better compression: Beyond the lexicographic order
ACM Transactions on Database Systems (TODS)
Implicit indexing of natural language text by reorganizing bytecodes
Information Retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A candidate filtering mechanism for fast top-k query processing on modern cpus
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bitlist: new full-text index for low space cost and efficient keyword search
Proceedings of the VLDB Endowment
Using rating matrix compression techniques to speed up collaborative recommendations
Information Retrieval
Hi-index | 0.00 |
Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.