Toward Efficient and Simplified Distributed Data Intensive Computing

Authors:
Yunhong Gu;Robert Grossman
Affiliations:
University of Illinois at Chicago, Chicago;University of Illinois at Chicago and the Open Data Group, Chicago
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2011

Citing 0
Cited 4

Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Octopus: efficient data intensive computing on virtualized datacenters

Proceedings of the 6th International Systems and Storage Conference
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
A tensor-based distributed discovery of missing association rules on the cloud

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While the capability of computing systems has been increasing at Moore's Law, the amount of digital data has been increasing even faster. There is a growing need for systems that can manage and analyze very large data sets, preferably on shared-nothing commodity systems due to their low expense. In this paper, we describe the design and implementation of a distributed file system called Sector and an associated programming framework called Sphere that processes the data managed by Sector in parallel. Sphere is designed so that the processing of data can be done in place over the data whenever possible. Sometimes, this is called data locality. We describe the directives Sphere supports to improve data locality. In our experimental studies, the Sector/Sphere system has consistently performed about 2-4 times faster than Hadoop, the most popular system for processing very large data sets.