Exploiting programmable network interfaces for parallel query execution in workstation clusters

Authors:
V. Santhosh Kumar;M. J. Thazhuthaveetil;R. Govindarajan
Affiliations:
Supercomputer Edn. and Res. Centre, Indian Institute of Science, Bangalore, India;Supercomputer Edn. and Res. Centre, Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore, India;Supercomputer Edn. and Res. Centre, Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore, India
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 16
Cited 0

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Join processing in relational databases

ACM Computing Surveys (CSUR)
Parallel database systems: the future of high performance database systems

Communications of the ACM
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Active disks: programming model, algorithms and evaluation

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Parallel database processing on a 100 Node PC cluster: cases for decision support query processing and data mining

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Design and evaluation of a smart disk cluster for DSS commercial workloads

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
A Case for NOW (Networks of Workstations)

IEEE Micro
Active Storage for Large-Scale Data Mining and Multimedia

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Timed Petri net models of multithreaded multiprocessor architectures

PNPM '97 Proceedings of the 6th International Workshop on Petri Nets and Performance Models
On Network CoProcessors for Scalable, Predictable Media Services

IEEE Transactions on Parallel and Distributed Systems
Microbenchmark Performance Comparison of High-Speed Cluster Interconnects

IEEE Micro

Quantified Score

Hi-index	0.00

Visualization

Abstract

Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups upto 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs a part of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64.