All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids

Authors:
Christopher Moretti;Hoang Bui;Karen Hollingsworth;Brandon Rich;Patrick Flynn;Douglas Thain
Affiliations:
University of Notre Dame, Notre Dame;University of Notre Dame, Notre Dame;University of Notre Dame, Notre Dame;University of Notre Dame, Notre Dame;University of Notre Dame, Notre Dame;University of Notre Dame, Notre Dame
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2010

Citing 0
Cited 12

Cloud technologies for bioinformatics applications

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An approach for processing large and non-uniform media objects on mapreduce-based clusters

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Evaluating the suitability of mapreduce for surface temperature analysis codes

Proceedings of the second international workshop on Data intensive computing in the clouds
Design patterns for scientific applications in DryadLINQ CTP

Proceedings of the second international workshop on Data intensive computing in the clouds
Provenance for MapReduce-based data-intensive workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Proceedings of the 2012 Joint EDBT/ICDT Workshops
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Approaches to Distributed Execution of Scientific Workflows in Kepler

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data-intensive workloads. We present one example of an abstraction—All-Pairs—that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieve performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500 CPU campus grid at the University of Notre Dame and has been used to carry out a groundbreaking analysis of biometric data.