Easy and effective parallel programmable ETL

Authors:
Christian Thomsen;Torben Bach Pedersen
Affiliations:
Aalborg University, Aalborg, Denmark;Aalborg University, Aalborg, Denmark
Venue:
Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Year:
2011

Citing 6
Cited 3

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Deciding the physical implementation of ETL workflows

Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
pygrametl: a powerful programming framework for extract-transform-load programmers

Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery

DOLAP 2011: overview of the 14th international workshop on data warehousing and olap

Proceedings of the 20th ACM international conference on Information and knowledge management
Daisy: the center for data-intensive systems at Aalborg University

ACM SIGMOD Record
A BPMN-Based Design and Maintenance Framework for ETL Processes

International Journal of Data Warehousing and Mining

Quantified Score

Hi-index	0.01

Visualization

Abstract

Extract-Transform-Load (ETL) programs are used to load data into data warehouses (DWs). An ETL program must extract data from sources, apply different transformations to it, and use the DW to look up/insert the data. It is both time consuming to develop and to run an ETL program. It is, however, typically the case that the ETL program can exploit both task parallelism and data parallelism to run faster. This, on the other hand, makes the development time longer as it is complex to create a parallel ETL program. To remedy this situation, we propose efficient ways to parallelize typical ETL tasks and we implement these new constructs in an ETL framework. The constructs are easy to apply and do only require few modifications to an ETL program to parallelize it. They support both task and data parallelism and give the programmer different possibilities to choose from. An experimental evaluation shows that by using a little more CPU time, the (wall-clock) time to run an ETL program can be greatly reduced.