Challenges in building large-scale information retrieval systems: invited talk

Authors:
Jeffrey Dean
Affiliations:
Google, Inc.
Venue:
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Year:
2009

Citing 0
Cited 36

ROAR: increasing the flexibility and performance of distributed search

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
From web data to entities and back

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
MapReduce for information retrieval evaluation: "let's quickly test this on 12 TB of data"

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
Cost-Aware Strategies for Query Result Caching in Web Search Engines

ACM Transactions on the Web (TWEB)
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Scalable multi-dimensional user intent identification using tree structured distributions

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Timestamp-based result cache invalidation for web search engines

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Small cache, big effect: provable load balancing for randomly partitioned cluster services

Proceedings of the 2nd ACM Symposium on Cloud Computing
Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
SIMD-based decoding of posting lists

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient phrase querying with flat position index

Proceedings of the 20th ACM international conference on Information and knowledge management
A five-level static cache architecture for web search engines

Information Processing and Management: an International Journal
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Learning to predict response times for online query scheduling

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Processing a trillion cells per mouse click

Proceedings of the VLDB Endowment
A short survey on the state of the art in architectures and platforms for large scale data analysis and knowledge discovery from data

Proceedings of the WICSA/ECSA 2012 Companion Volume
A distributed index for efficient parallel top-k keyword search on massive graphs

Proceedings of the twelfth international workshop on Web information and data management
Reordering an index to speed query processing without loss of effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
Efficient and effective retrieval using selective pruning

Proceedings of the sixth ACM international conference on Web search and data mining
Optimizing top-k document retrieval strategies for block-max indexes

Proceedings of the sixth ACM international conference on Web search and data mining
Hybrid query scheduling for a replicated search engine

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The impact of solid state drive on search engine cache management

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A candidate filtering mechanism for fast top-k query processing on modern cpus

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scalability and efficiency challenges in commercial web search engines

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Permutation indexing: fast approximate retrieval from large corpora

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Load-sensitive selective pruning for distributed search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Learning to rank query suggestions for adhoc and diversity search

Information Retrieval

Quantified Score

Hi-index	0.02

Visualization

Abstract

Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval. In this talk I will discuss the evolution of Google's hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I will also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems. Finally, I will describe some future challenges and open research problems in this area.