Loading databases using dataflow parallelism

  • Authors:
  • Tom Barclay;Robert Barnes;Jim Gray;Prakash Sundaresan

  • Affiliations:
  • Digital Equipment Corporation, San Francisco Systems Center, Microsoft, One Microsoft Way, Redmond, WA;Digital Equipment Corporation, San Francisco Systems Center, Microsoft, One Microsoft Way, Redmond, WA;Digital Equipment Corporation, San Francisco Systems Center, 310 Filbert St., S.F., CA;Digital Equipment Corporation, San Francisco Systems Center, Informix, 921 SW Washington St. # 670, Portland, OR

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a parallel database load prototype for Digital's Rdb database product. The prototype takes a dataflow approach to database parallelism. It includes an explorer that discovers and records the cluster configuration in a database, a client CUI interface that gathers the load job description from the user and from the Rdb catalogs, and an optimizer that picks the best parallel execution plan and records it in a web data structure. The web describes the data operators, the dataflow rivers among them, the binding of operators to processes, processes to processors, and files to discs and tapes. This paper describes the optimizer's cost-based hierarchical optimization strategy in some detail. The prototype executes the web's plan by spawning a web manager process at each node of the cluster. The managers create the local executor processes, and orchestrate startup, phasing, checkpoint, and shutdown. The execution processes perform one or more operators. Data flows among the operators are via memory-to-memory streams within a node, and via web-manager multiplexed tcp/ip streams among nodes. The design of the transaction and checkpoint/restart mechanisms are also described. Preliminary measurements indicate that this design will give excellent scaleups.