DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Authors:
Yin Huai;Rubao Lee;Simon Zhang;Cathy H. Xia;Xiaodong Zhang
Affiliations:
The Ohio State University;The Ohio State University;Cornell University;The Ohio State University;The Ohio State University
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 21
Cited 8

A bridging model for parallel computation

Communications of the ACM
Latency metric: an experimental method for measuring and evaluating parallel program and architecture scalability

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
LogP: a practical model of parallel computation

Communications of the ACM
On optimizing an SQL-like nested query

ACM Transactions on Database Systems (TODS)
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Request Window: an approach to improve throughput of RDBMS-based data integration system by utilizing data sharing across concurrent distributed queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
On distributing symmetric streaming computations

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Distributed Computing Economics

Queue - Object-Relational Mapping
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A common substrate for cluster computing

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
YSmart: Yet Another SQL-to-MapReduce Translator

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems

On distributed computation rate optimization for deploying cloud computing programming frameworks

ACM SIGMETRICS Performance Evaluation Review
Modeling I/O interference for data intensive distributed applications

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
ACIC: automatic cloud I/O configurator for HPC applications

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Consolidated cluster systems for data centers in the cloud age: a survey and analysis

Frontiers of Computer Science: Selected Publications from Chinese Universities
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

Proceedings of the 4th annual Symposium on Cloud Computing
A distributed rule execution mechanism based on MapReduce in sematic web reasoning

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional parallel processing models, such as BSP, are "scale up" based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems "scale out" by continuously adding computing and storage resources through networks. Each one of the "scale up" model and "scale out" model has a different set of performance requirements and system bottlenecks. In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.