Sorting out the document identifier assignment problem

Authors:
Fabrizio Silvestri
Affiliations:
Institute for Information Science and Technologies, ISTI, CNR, Pisa, Italy
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 15
Cited 27

Modeling word occurrences for the compression of concordances

ACM Transactions on Information Systems (TOIS)
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Compressing Inverted Files

Information Retrieval
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Implementation of the SMART Information Retrieval System

Implementation of the SMART Information Retrieval System
The Link Database: Fast Access to Graphs of the Web

DCC '02 Proceedings of the Data Compression Conference
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Simplified similarity scoring using term ranks

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The automatic creation of literature abstracts

IBM Journal of Research and Development
Document identifier reassignment through dimensionality reduction

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Site-based dynamic pruning for query processing in search engines

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Permuting Web Graphs

WAW '09 Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Entry Pairing in Inverted File

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Effect of different docid orderings on dynamic pruning retrieval strategies

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Optimized top-k processing with global page scores on block-max indexes

Proceedings of the fifth ACM international conference on Web search and data mining
Reordering an index to speed query processing without loss of effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
Optimizing top-k document retrieval strategies for block-max indexes

Proceedings of the sixth ACM international conference on Web search and data mining
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A candidate filtering mechanism for fast top-k query processing on modern cpus

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval
Using rating matrix compression techniques to speed up collaborative recommendations

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.