Incremental cluster-based retrieval using compressed cluster-skipping inverted files

Authors:
Ismail Sengor Altingovde;Engin Demir;Fazli Can;Özgür Ulusoy
Affiliations:
Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2008

Citing 42
Cited 11

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
Algorithms for clustering data

Algorithms for clustering data
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

ACM Transactions on Database Systems (TODS)
Ranking algorithms

Information retrieval
Incremental clustering for dynamic information processing

ACM Transactions on Information Systems (TOIS)
On the efficiency of best-match cluster searches

Information Processing and Management: an International Journal
Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Fast evaluation of structured queries for information retrieval

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
The efficiency of inverted index and cluster searches

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
The cluster hypothesis revisited

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Document retrieval based on clustered files

Document retrieval based on clustered files
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Efficiency and effectiveness of query processing in cluster-based retrieval

Information Systems
Simplified similarity scoring using term ranks

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic information and library processing

Dynamic information and library processing
Automatic ranking of information retrieval systems using data fusion

Information Processing and Management: an International Journal
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Performance of query processing implementations in ranking-based text retrieval systems using inverted indices

Information Processing and Management: an International Journal
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Large-scale cluster-based retrieval experiments on Turkish texts

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Optimization of restricted searches in web directories using hybrid data structures

ECIR'03 Proceedings of the 25th European conference on IR research
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Algorithms for within-cluster searches using inverted files

ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
Space-Limited ranked query evaluation using adaptive pruning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Site-based dynamic pruning for query processing in search engines

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
New event detection and topic tracking in Turkish

Journal of the American Society for Information Science and Technology
Efficient processing of category-restricted queries for web directories

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Static index pruning in web search engines: Combining term and document popularities with query views

ACM Transactions on Information Systems (TOIS)
Evaluating subtopic retrieval methods: Clustering versus diversification of search results

Information Processing and Management: an International Journal
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Cluster searching strategies for collaborative recommendation systems

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.