A system for adaptive disk rearrangement
Software—Practice & Experience
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient distributed algorithms to build inverted files
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The term vector database: fast access to indexing terms for Web pages
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
WebBase: a repository of Web pages
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Building a distributed full-text index for the Web
Proceedings of the 10th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
Mercator: A scalable, extensible Web crawler
World Wide Web
Trading capacity for performance in a disk array
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
I/O-Conscious Volume Rendering
EGVISSYM'01 Proceedings of the 3rd Joint Eurographics - IEEE TCVG conference on Visualization
Implementation of a modern web search engine cluster
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Just in time indexing for up to the second search
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
An update-aware storage system for low-locality update-intensive workloads
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Hi-index | 0.00 |
Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is co-ordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation.