Optimizing analytic data flows for multiple execution engines

Authors:
Alkis Simitsis;Kevin Wilkinson;Malu Castellanos;Umeshwar Dayal
Affiliations:
HP Labs, Palo Alto, CA, USA;HP Labs, Palo Alto, CA, USA;HP Labs, Palo Alto, CA, USA;HP Labs, Palo Alto, CA, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 18
Cited 7

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
The Garlic project

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query Optimization in a Heterogeneous DBMS

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Processing Queries Over Generalization Hierarchies in a Multidatabase System

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Optimizing ETL Processes in Data Warehouses

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
PQR: Predicting Query Execution Times for Autonomous Workload Management

ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Parallelizing query optimization

Proceedings of the VLDB Endowment
QoX-driven ETL design: reducing the cost of ETL consulting engagements

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Schedule optimization for data processing flows on the cloud

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware

Optimizing flows for real time operations management

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
xPAD: a platform for analytic data flows

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The farm: where pig scripts are bred and raised

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Multi-objective optimization of data flows in a multi-cloud environment

Proceedings of the Second Workshop on Data Analytics in the Cloud
Towards a workload for evolutionary analytics

Proceedings of the Second Workshop on Data Analytics in the Cloud
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and time-consuming. Optimizing these flows is currently an ad-hoc process where the result is largely dependent on the abilities and experience of the flow designer. Our previous work addressed analytic flow optimization for multiple objectives over a single execution engine. This paper focuses on optimizing flows for a single objective, namely performance, over multiple execution engines. We consider flows that span a DBMS, a Map-Reduce engine, and an orchestration engine (e.g., an ETL tool or scripting language). This configuration is emerging as a common paradigm used to combine analysis of unstructured data with analysis of structured data (e.g., NoSQL plus SQL). We present flow transformations that model data shipping, function shipping, and operation decomposition and we describe how flow graphs are generated for multiple engines. Performance results for various configurations demonstrate the benefit of optimization.