MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Deciding the physical implementation of ETL workflows
Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
pygrametl: a powerful programming framework for extract-transform-load programmers
Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
ETLMR: a highly scalable dimensional ETL framework based on mapreduce
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
DOLAP 2011: overview of the 14th international workshop on data warehousing and olap
Proceedings of the 20th ACM international conference on Information and knowledge management
Daisy: the center for data-intensive systems at Aalborg University
ACM SIGMOD Record
A BPMN-Based Design and Maintenance Framework for ETL Processes
International Journal of Data Warehousing and Mining
Hi-index | 0.01 |
Extract-Transform-Load (ETL) programs are used to load data into data warehouses (DWs). An ETL program must extract data from sources, apply different transformations to it, and use the DW to look up/insert the data. It is both time consuming to develop and to run an ETL program. It is, however, typically the case that the ETL program can exploit both task parallelism and data parallelism to run faster. This, on the other hand, makes the development time longer as it is complex to create a parallel ETL program. To remedy this situation, we propose efficient ways to parallelize typical ETL tasks and we implement these new constructs in an ETL framework. The constructs are easy to apply and do only require few modifications to an ETL program to parallelize it. They support both task and data parallelism and give the programmer different possibilities to choose from. An experimental evaluation shows that by using a little more CPU time, the (wall-clock) time to run an ETL program can be greatly reduced.