Distributed processing of very large datasets with DataCutter
Parallel Computing - Clusters and computational grids for scientific computing
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Distributing the Sloan Digital Sky Survey Using UDT and Sector
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
UDT: UDP-based data transfer for high-speed wide area networks
Computer Networks: The International Journal of Computer and Telecommunications Networking
Explicit control a batch-aware distributed file system
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Toward loosely coupled programming on petascale systems
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
A MapReduce workflow system for architecting scientific data intensive applications
Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment
UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
In this paper, we discuss some of the lessons that we have learned working with the Hadoop and Sector/Sphere systems. Both of these systems are cloud-based systems designed to support data intensive computing. Both include distributed file systems and closely coupled systems for processing data in parallel. Hadoop uses MapReduce, while Sphere supports the ability to execute an arbitrary user defined function over the data managed by Sector. We compare and contrast these systems and discuss some of the design trade-offs necessary in data intensive computing. In our experimental studies over the past year, Sector/Sphere has consistently performed about 2--4 times faster than Hadoop. We discuss some of the reasons that might be responsible for this difference in performance.