The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)

Authors:
Dan Tsafrir
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ecs'07 Experimental computer science on Experimental computer science
Year:
2007

Citing 14
Cited 0

Soft timers: efficient microsecond software timer support for network processing

ACM Transactions on Computer Systems (TOCS)
The structure of the “THE”-multiprogramming system

Communications of the ACM
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Effects of clock resolution on the scheduling of interactive and soft real-time processes

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Hardware support for real-time operating systems

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Desktop scheduling: how can we know what the user wants?

NOSSDAV '04 Proceedings of the 14th international workshop on Network and operating systems support for digital audio and video
Improving application performance on HPC systems with process synchronization

Linux Journal
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
System noise, OS clock ticks, and fine-grained parallel applications

Proceedings of the 19th annual international conference on Supercomputing
A performance comparison through benchmarking and modeling of three leading supercomputers: blue Gene/L, Red Storm, and Purple

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Process prioritization using output production: Scheduling for multimedia

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Fine grained kernel logging with KLogger: experience and insights

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Quantified Score

Hi-index	0.00

Visualization

Abstract

The overhead of a context switch is typically associated with multitasking, where several applications share a processor. But even if only one runnable application is present in the system and supposedly runs alone, it is still repeatedly preempted in favor of a different thread of execution, namely, the operating system that services periodic clock interrupts. We employ two complementing methodologies to measure the overhead incurred by such events and obtain contradictory results. The first methodology systematically changes the interrupt frequency and measures by how much this prolongs the duration of a program that sorts an array. The over-all overhead is found to be 0.5-1.5% at 1000 Hz, linearly proportional to the tick rate, and steadily declining as the speed of processors increases. If the kernel is configured such that each tick is slowed down by an access to an external time source, then the direct overhead dominates. Otherwise, the relative weight of the indirect portion is steadily growing with processors' speed, accounting for up to 85% of the total. The second methodology repeatedly executes a simplistic loop (calibrated to take 1ms), measures the actual execution time, and analyzes the perturbations. Some loop implementations yield results similar to the above, but others indicate that the overhead is actually an order of magnitude bigger, or worse. The phenomenon was observed on IA32, IA64, and Power processors, the latter being part of the ASC Purple supercomputer. Indeed, the effect is dramatically amplified for parallel jobs, where one late thread holds up all its peers, causing a slowdown that is dominated by the per-node latency (numerator) and the job granularity (denominator). We show the effect is due to an unexplained interrupt/loop interaction; the question of whether this hardware misfeature is experienced by real applications remains open.