Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

Authors:
Pierfrancesco Foglia;Roberto Giorgi;Cosimo Antonio Prete
Affiliations:
Dipartimento di Ingegneria dell'Informazione, Università di Pisa, Via Diotisalvi 2, 56126 Pisa, Italy;Dipartimento di Ingegneria dell'Informazione, Università di Siena, Via Roma 56, 53100 Siena, Italy;Dipartimento di Ingegneria dell'Informazione, Università di Pisa, Via Diotisalvi 2, 56126 Pisa, Italy
Venue:
Journal of Parallel and Distributed Computing
Year:
2005

Citing 30
Cited 1

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Address Tracing for Parallel Machines

Computer - Special issue on experimental research in computer architecture
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory system performance of UNIX on CC-NUMA multiprocessors

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
An analysis of degenerate sharing and false coherence

Journal of Parallel and Distributed Computing
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace-driven memory simulation: a survey

ACM Computing Surveys (CSUR)
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Pentium Pro and Pentium II system architecture (2nd ed.)

Pentium Pro and Pentium II system architecture (2nd ed.)
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
New TPC benchmarks for decision support and web commerce

ACM SIGMOD Record
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
The sun fireplane system interconnect

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Trace Factory: Generating Workloads for Trace-Driven Simulation of Shared-Bus Multiprocessors

IEEE Parallel & Distributed Technology: Systems & Technology
Trends in Shared Memory Multiprocessing

Computer
Hardware Approaches Coherence in Shared-Memory Multiprocessors, Part 1

IEEE Micro
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
A Trace-Driven Simulator for Performance Evaluation of Cache-Based Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Comparing the Memory System Performance of DSS Workloads on the HP V-Class and SGI Origin 2000

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Detailed Characterization of a Quad Pentium Pro Server Running TPC-D

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design

Model-based cache-aware dispatching of object-oriented software for multicore systems

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we characterized the memory performance-and in particular the impact of coherence overhead and process migration-of a shared-bus shared-memory multiprocessor running a DSS workload. When the number of processors is increased in order to achieve higher computational power, the bus becomes a major bottleneck of such architecture. We evaluated solutions that can greatly reduce that bottleneck. An area where this kind of optimization is important regards data base systems. For this reason, we considered a DSS workload and we setup the experiments following TPC-D specifications on the PostgreSQL DBMS in order to explore different optimizations on same kind of workloads as evaluated in the literature. In this scenario, we compare possible solutions to boost performance and we show the impact of process migration on coherence overhead. We found that the consequences of coherence overhead and process migration on performance are very important in machines with 16 or more processors. In this case, even little sharing, as in DSS applications, can become crucial for system performance. Another important result of our analysis regards the interaction between the coherence protocol and the scheduler. The basic cache affinity scheduling is useful in reducing migration, but it is not effective in every load condition. Specific coherence protocols can help reduce the effects of process migration, especially in situations when the scheduler cannot apply the affinity requirement. In these conditions, the use of a write-update protocol with a selective invalidation strategy for private data improves performance (and scalability) of about 20% with respect to a classical MESI-based solution. This advantage is about 50% in the case of high cache-to-cache transfer.