Semi-Distributed Load Balancing for Massively Parallel Multicomputer Systems
IEEE Transactions on Software Engineering
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Analyses and optimizations for shared address space programs
Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
MPI-FM: high performance MPI on workstation clusters
Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Effect of task duplication on the assignment of dependency graphs
Parallel Computing
Walking the tightrope: responsive yet stable traffic engineering
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Task assignment in heterogeneous computing systems
Journal of Parallel and Distributed Computing
STAR-MPI: self tuned adaptive routines for MPI collective operations
Proceedings of the 20th annual international conference on Supercomputing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
MapReduce optimization using regulated dynamic prioritization
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The nature of data center traffic: measurements & analysis
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
Making cloud intermediate data fault-tolerant
Proceedings of the 1st ACM symposium on Cloud computing
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Pipelined broadcast on ethernet switched clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Scarlett: coping with skewed content popularity in mapreduce clusters
Proceedings of the sixth conference on Computer systems
Sharing the data center network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Disk-locality in datacenter computing considered irrelevant
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Non-deterministic parallelism considered useful
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Exploring MapReduce efficiency with highly-distributed data
Proceedings of the second international workshop on MapReduce and its applications
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Managing data transfers in computer clusters with orchestra
Proceedings of the ACM SIGCOMM 2011 conference
Towards predictable datacenter networks
Proceedings of the ACM SIGCOMM 2011 conference
Disco: a computing platform for large-scale data analytics
Proceedings of the 10th ACM SIGPLAN workshop on Erlang
Purlieus: locality-aware resource allocation for MapReduce in a cloud
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The price is right: towards location-independent costs in datacenters
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Mitigating the negative impact of preemption on heterogeneous MapReduce workloads
Proceedings of the 7th International Conference on Network and Services Management
Tarazu: optimizing MapReduce on heterogeneous clusters
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Improving Hadoop performance in intercloud environments
ACM SIGMETRICS Performance Evaluation Review
Jockey: guaranteed job latency in data parallel clusters
Proceedings of the 7th ACM european conference on Computer Systems
CloudSense: continuous fine-grain cloud monitoring with compressive sensing
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Resource provisioning framework for mapreduce jobs with performance goals
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Resource-aware adaptive scheduling for mapreduce clusters
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Synergy2cloud: introducing cross-sharing of application experiences into the cloud management cycle
Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Understanding the effects and implications of compute node related failures in hadoop
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
The only constant is change: incorporating time-varying network reservations in data centers
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Programming your network at run-time for big data applications
Proceedings of the first workshop on Hot topics in software defined networks
The seven deadly sins of cloud computing research
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Why let resources idle? aggressive cloning of jobs with dolly
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Predicting execution bottlenecks in map-reduce clusters
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Server-assisted latency management for wide-area distributed systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads
Proceedings of the VLDB Endowment
The only constant is change: incorporating time-varying network reservations in data centers
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Hierarchical merge for scalable MapReduce
Proceedings of the 2012 workshop on Management of big data systems
SCOPE: parallel databases meet MapReduce
The VLDB Journal — The International Journal on Very Large Data Bases
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Bridging the tenant-provider gap in cloud services
Proceedings of the Third ACM Symposium on Cloud Computing
True elasticity in multi-tenant data-intensive compute clusters
Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
Resource provisioning framework for MapReduce jobs with performance goals
Proceedings of the 12th International Middleware Conference
Resource-aware adaptive scheduling for MapReduce clusters
Proceedings of the 12th International Middleware Conference
Cogset: a high performance MapReduce engine
Concurrency and Computation: Practice & Experience
Theia: visual signatures for problem diagnosis in large hadoop clusters
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Breaking the MapReduce stage barrier
Cluster Computing
A study of unpredictability in fault-tolerant middleware
Computer Networks: The International Journal of Computer and Telecommunications Networking
Interference and locality-aware task scheduling for MapReduce applications in virtual clusters
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
BlinkDB: queries with bounded errors and bounded response times on very large data
Proceedings of the 8th ACM European Conference on Computer Systems
Presto: distributed machine learning and graph processing with sparse matrices
Proceedings of the 8th ACM European Conference on Computer Systems
CPI2: CPU performance isolation for shared compute clusters
Proceedings of the 8th ACM European Conference on Computer Systems
Effective straggler mitigation: attack of the clones
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Split/merge: system support for elastic execution in virtual middleboxes
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
EyeQ: practical network performance isolation at the edge
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Rhea: automatic filtering for unstructured cloud storage
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Leveraging endpoint flexibility in data-intensive clusters
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Speeding up distributed request-response workflows
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
MapReduce with communication overlap (MaRCO)
Journal of Parallel and Distributed Computing
The case for tiny tasks in compute clusters
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Solving the straggler problem with bounded staleness
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Mammoth: autonomic data processing framework for scientific state-transition applications
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Proceedings of the 4th annual Symposium on Cloud Computing
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
Joint optimization of overlapping phases in MapReduce
Performance Evaluation
PIKACHU: how to rebalance load in optimizing mapreduce on heterogeneous clusters
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads
Proceedings of the VLDB Endowment
Quasar: resource-efficient and QoS-aware cluster management
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
MapReduce "garbage" collection
CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
SpeQuloS: a QoS service for hybrid and elastic computing infrastructures
Cluster Computing
GRASS: trimming stragglers in approximation analytics
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.