Themis: an I/O-efficient MapReduce

Authors:
Alexander Rasmussen;Vinh The Lam;Michael Conley;George Porter;Rishi Kapoor;Amin Vahdat
Affiliations:
UC San Diego;UC San Diego;UC San Diego;UC San Diego;UC San Diego;UC San Diego & Google, Inc.
Venue:
Proceedings of the Third ACM Symposium on Cloud Computing
Year:
2012

Citing 28
Cited 3

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The input/output complexity of sorting and related problems

Communications of the ACM
Parallel database systems: the future of high performance database systems

Communications of the ACM
Eliminating receive livelock in an interrupt-driven kernel

ACM Transactions on Computer Systems (TOCS)
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Advances in the dataflow computational model

Parallel Computing - Special Anniversary issue
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Fail-Stutter Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Subtleties in tolerating correlated failures in wide-area storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation

Statistics and Computing
Practical System Reliability

Practical System Reliability
CloudBurst

Bioinformatics
Efficiency matters!

ACM SIGOPS Operating Systems Review
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
TritonSort: a balanced large-scale sorting system

Proceedings of the 8th USENIX conference on Networked systems design and implementation
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

QuickSAN: a storage area network for fast, distributed, solid state disks

Proceedings of the 40th Annual International Symposium on Computer Architecture
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

"Big Data" computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present Themis, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amount possible for data sets that cannot fit in memory. In order to minimize I/O, Themis makes fundamentally different design decisions from previous MapReduce implementations. Themis performs a wide variety of MapReduce jobs -- including click log analysis, DNA read sequence alignment, and PageRank -- at nearly the speed of TritonSort's record-setting sort performance [29].