Optimization of analytic data flows for next generation business intelligence applications

Authors:
Umeshwar Dayal;Kevin Wilkinson;Alkis Simitsis;Malu Castellanos;Lupita Paz
Affiliations:
HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA
Venue:
TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Year:
2011

Citing 21
Cited 1

Global query optimization

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Multiple-query optimization

ACM Transactions on Database Systems (TODS)
The Garlic project

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Optimizing Queries Across Diverse Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Query Optimization in a Heterogeneous DBMS

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Processing Queries Over Generalization Hierarchies in a Multidatabase System

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Optimizing ETL Processes in Data Warehouses

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
An approach to optimize data processing in business processes

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Parallelizing query optimization

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Data integration flows for business intelligence

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
QoX-driven ETL design: reducing the cost of ETL consulting engagements

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Benchmarking ETL Workflows

Performance Evaluation and Benchmarking
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Leveraging business process models for ETL design

ER'10 Proceedings of the 29th international conference on Conceptual modeling
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation

Analytics-as-a-service: confluence of big data, cloud computing and software-as-a-service

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the challenge of optimizing analytic data flows for modern business intelligence (BI) applications. We first describe the changing nature of BI in today's enterprises as it has evolved from batch-based processes, in which the back-end extraction-transform-load (ETL) stage was separate from the front-end query and analytics stages, to near real-time data flows that fuse the back-end and front-end stages. We describe industry trends that force new BI architectures, e.g., mobile and cloud computing, semi-structured content, event and content streams as well as different execution engine architectures. For execution engines, the consequence of "one size does not fit all" is that BI queries and analytic applications now require complicated information flows as data is moved among data engines and queries span systems. In addition, new quality of service objectives are desired that incorporate measures beyond performance such as freshness (latency), reliability, accuracy, and so on. Existing approaches that optimize data flows simply for performance on a single system or a homogeneous cluster are insufficient. This paper describes our research to address the challenge of optimizing this new type of flow. We leverage concepts from earlier work in federated databases, but we face a much larger search space due to new objectives and a larger set of operators. We describe our initial optimizer that supports multiple objectives over a single processing engine. We then describe our research in optimizing flows for multiple engines and objectives and the challenges that remain.