Lessons learned from a year's worth of benchmarks of large data clouds

Authors:
Yunhong Gu;Robert L. Grossman
Affiliations:
University of Illinois at Chicago;University of Illinois at Chicago
Venue:
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Year:
2009

Citing 11
Cited 2

Distributed processing of very large datasets with DataCutter

Parallel Computing - Clusters and computational grids for scientific computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Distributing the Sloan Digital Sky Survey Using UDT and Sector

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
UDT: UDP-based data transfer for high-speed wide area networks

Computer Networks: The International Journal of Computer and Telecommunications Networking
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Toward loosely coupled programming on petascale systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication

A MapReduce workflow system for architecting scientific data intensive applications

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we discuss some of the lessons that we have learned working with the Hadoop and Sector/Sphere systems. Both of these systems are cloud-based systems designed to support data intensive computing. Both include distributed file systems and closely coupled systems for processing data in parallel. Hadoop uses MapReduce, while Sphere supports the ability to execute an arbitrary user defined function over the data managed by Sector. We compare and contrast these systems and discuss some of the design trade-offs necessary in data intensive computing. In our experimental studies over the past year, Sector/Sphere has consistently performed about 2--4 times faster than Hadoop. We discuss some of the reasons that might be responsible for this difference in performance.