Pruning policies for two-tiered inverted index with correctness guarantee

Authors:
Alexandros Ntoulas;Junghoo Cho
Affiliations:
Microsoft;UCLA
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 25
Cited 30

Optimizing queries over multimedia repositories

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

ACM Transactions on Information Systems (TOIS)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Building a distributed full-text index for the web

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Combining fuzzy information: an overview

ACM SIGMOD Record
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Introduction to Algorithms

Introduction to Algorithms
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
Towards Efficient Multi-Feature Queries in Heterogeneous Environments

ITCC '01 Proceedings of the International Conference on Information Technology: Coding and Computing
Optimizing result prefetching in web search engines with segmented indices

ACM Transactions on Internet Technology (TOIT)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Pruning strategies for mixed-mode querying

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search

Proceedings of the 2008 ACM symposium on Applied computing
Query-based partitioning of documents and indexes for information lifecycle management

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
ResIn: a combination of results caching and index pruning for high-performance web search engines
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Can phrase indexing help to process non-phrase queries?

Proceedings of the 17th ACM conference on Information and knowledge management
Top-k aggregation using intersections of ranked inputs

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Using graphics processors for high performance IR query processing

Proceedings of the 18th international conference on World wide web
A Study of the Impact of Index Updates on Distributed Query Processing for Web Search

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Effective top-k computation with term-proximity support

Information Processing and Management: an International Journal
Efficiency trade-offs in two-tier web search systems

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On the feasibility of multi-site web search engines

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting query views for static index pruning in web search engines

Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic static pruning of inverted files

ACM Transactions on Information Systems (TOIS)
Efficient processing of exact top-k queries over disk-resident sorted lists

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to efficiently rank

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Ranking under temporal constraints

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Cost-Aware Strategies for Query Result Caching in Web Search Engines

ACM Transactions on the Web (TWEB)
Allocating inverted index into flash memory for search engines

Proceedings of the 20th international conference companion on World wide web
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Static index pruning in web search engines: Combining term and document popularities with query views

ACM Transactions on Information Systems (TOIS)
XML retrieval using pruned element-index files

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Index ordering by query-independent measures

Information Processing and Management: an International Journal
Cache-Based Query Processing for Search Engines

ACM Transactions on the Web (TWEB)
Document replication strategies for geographically distributed web search engines

Information Processing and Management: an International Journal
Document selection for tiered indexing in commerce search

Proceedings of the sixth ACM international conference on Web search and data mining
Fast document-at-a-time query processing using two-tier indexes

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A Fast Static Index Pruning Algorithm

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.