ACM Transactions on Database Systems (TODS)
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query Optimization in a Heterogeneous DBMS
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Processing Queries Over Generalization Hierarchies in a Multidatabase System
VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Optimizing ETL Processes in Data Warehouses
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
PQR: Predicting Query Execution Times for Autonomous Workload Management
ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Parallelizing query optimization
Proceedings of the VLDB Endowment
QoX-driven ETL design: reducing the cost of ETL consulting engagements
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
CIEL: a universal execution engine for distributed data-flow computing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Schedule optimization for data processing flows on the cloud
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
Resource provisioning framework for mapreduce jobs with performance goals
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Optimizing flows for real time operations management
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics
Proceedings of the 16th International Conference on Extending Database Technology
xPAD: a platform for analytic data flows
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The farm: where pig scripts are bred and raised
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Multi-objective optimization of data flows in a multi-cloud environment
Proceedings of the Second Workshop on Data Analytics in the Cloud
Towards a workload for evolutionary analytics
Proceedings of the Second Workshop on Data Analytics in the Cloud
Hybrid Analytic Flows-the Case for Optimization
Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Hi-index | 0.00 |
Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and time-consuming. Optimizing these flows is currently an ad-hoc process where the result is largely dependent on the abilities and experience of the flow designer. Our previous work addressed analytic flow optimization for multiple objectives over a single execution engine. This paper focuses on optimizing flows for a single objective, namely performance, over multiple execution engines. We consider flows that span a DBMS, a Map-Reduce engine, and an orchestration engine (e.g., an ETL tool or scripting language). This configuration is emerging as a common paradigm used to combine analysis of unstructured data with analysis of structured data (e.g., NoSQL plus SQL). We present flow transformations that model data shipping, function shipping, and operation decomposition and we describe how flow graphs are generated for multiple engines. Performance results for various configurations demonstrate the benefit of optimization.