A scalable mark-sweep garbage collector on large-scale shared-memory machines

  • Authors:
  • Toshio Endo;Kenjiro Taura;Akinori Yonezawa

  • Affiliations:
  • The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan;The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan;The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan

  • Venue:
  • SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work describes implementation of a mark-sweep garbage collector (GC) for shared-memory machines and reports its performance. It is a simple ''parallel'' collector in which all processors cooperatively traverse objects in the global shared heap. The collector stops the application program during a collection and assumes a uniform access cost to all locations in the shared heap. Implementation is based on the Boehm-Demers-Weiser conservative GC (Boehm GC). Experiments have been done on Ultra Enterprise 10000 (Ultra Sparc processor 250 MHz, 64 processors). We wrote two applications, BH (an N-body problem solver) and CKY (a context free grammar parser) in a parallel extension to C++.Through the experiments, We observe that load balancing is the key to achieving scalability. A naive collector without load redistribution hardly exhibits speed-up (at most fourfold speed-up on 64 processors). Performance can be improved by dynamic load balancing, which exchanges objects to be scanned between processors, but we still observe that straightforward implementation severely limits performance. First, large objects become a source of significant load imbalance, because the unit of load redistribution is a single object. Performance is improved by splitting a large object into small pieces before pushing it onto the mark stack. Next, processors spend a significant amount of time uselessly because of serializing method for termination detection using a shared counter. This problem suddenly appeared on more than 32 processors. By implementing non-serializing method for termination detection, the idle time is eliminated and performance is improved. With all these careful implementation, we achieved average speed-up of 28.0 in BH and 28.6 in CKY on 64 processors.