I/O-conscious data preparation for large-scale web search engines

  • Authors:
  • Maxim Lifantsev;Tzi-cker Chiueh

  • Affiliations:
  • Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY;Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY

  • Venue:
  • VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is co-ordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation.