TSP and cluster-based solutions to the reassignment of document identifiers

Authors:
Roi Blanco;Álvaro Barreiro
Affiliations:
IRLab. Computer Science Department, University of Corunna, Spain;IRLab. Computer Science Department, University of Corunna, Spain
Venue:
Information Retrieval
Year:
2006

Citing 0
Cited 9

Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Entry Pairing in Inverted File

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment
Using rating matrix compression techniques to speed up collaborative recommendations

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent studies demonstrated that it is possible to reduce Inverted Files (IF) sizes by reassigning the document identifiers of the original collection, as this lowers the distance between the positions of documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. This paper presents an efficient solution to the reassignment problem, which consists in reducing the input data dimensionality using a SVD transformation, as well as considering it a Travelling Salesman Problem (TSP). We also present some efficient solutions based on clustering. Finally, we combine both the TSP and the clustering strategies for reordering the document identifiers. We present experimental tests and performance results in two text TREC collections, obtaining good compression ratios with low running times, and advance the possibility of obtaining scalable solutions for web collections based on the techniques presented here.