I/O-conscious data preparation for large-scale web search engines

Authors:
Maxim Lifantsev;Tzi-cker Chiueh
Affiliations:
Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY;Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 15
Cited 3

A system for adaptive disk rearrangement

Software—Practice & Experience
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The term vector database: fast access to indexing terms for Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Mercator: A scalable, extensible Web crawler

World Wide Web
Trading capacity for performance in a disk array

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
I/O-Conscious Volume Rendering

EGVISSYM'01 Proceedings of the 3rd Joint Eurographics - IEEE TCVG conference on Visualization

Implementation of a modern web search engine cluster

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Just in time indexing for up to the second search

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
An update-aware storage system for low-locality update-intensive workloads

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is co-ordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation.