Loading databases using dataflow parallelism

Authors:
Tom Barclay;Robert Barnes;Jim Gray;Prakash Sundaresan
Affiliations:
Digital Equipment Corporation, San Francisco Systems Center, Microsoft, One Microsoft Way, Redmond, WA;Digital Equipment Corporation, San Francisco Systems Center, Microsoft, One Microsoft Way, Redmond, WA;Digital Equipment Corporation, San Francisco Systems Center, 310 Filbert St., S.F., CA;Digital Equipment Corporation, San Francisco Systems Center, Informix, 921 SW Washington St. # 670, Portland, OR
Venue:
ACM SIGMOD Record
Year:
1994

Citing 6
Cited 12

Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (SDC)

Proceedings of the sixteenth international conference on Very large databases
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Snowball: Scalable Storage on Networks of Workstations with Balanced Load

Distributed and Parallel Databases
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Optimized Data Loading for a Multi-Terabyte Sky Survey Repository

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
Millennium sort: a cluster-based application for windows NT using DCOM, river primitives and the virtual interface architecture

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rescuing of intelligence and electronic security core applications (RIESCA)

WSEAS TRANSACTIONS on SYSTEMS
Modeling and simulation of critical infrastructures case: rescuing of intelligence and electronic security core applications (RIESCA)

ECC'08 Proceedings of the 2nd conference on European computing conference
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a parallel database load prototype for Digital's Rdb database product. The prototype takes a dataflow approach to database parallelism. It includes an explorer that discovers and records the cluster configuration in a database, a client CUI interface that gathers the load job description from the user and from the Rdb catalogs, and an optimizer that picks the best parallel execution plan and records it in a web data structure. The web describes the data operators, the dataflow rivers among them, the binding of operators to processes, processes to processors, and files to discs and tapes. This paper describes the optimizer's cost-based hierarchical optimization strategy in some detail. The prototype executes the web's plan by spawning a web manager process at each node of the cluster. The managers create the local executor processes, and orchestrate startup, phasing, checkpoint, and shutdown. The execution processes perform one or more operators. Data flows among the operators are via memory-to-memory streams within a node, and via web-manager multiplexed tcp/ip streams among nodes. The design of the transaction and checkpoint/restart mechanisms are also described. Preliminary measurements indicate that this design will give excellent scaleups.