ETLMR: a highly scalable dimensional ETL framework based on mapreduce

Authors:
Xiufeng Liu;Christian Thomsen;Torben Bach Pedersen
Affiliations:
Dept. of Computer Science, Aalborg University;Dept. of Computer Science, Aalborg University;Dept. of Computer Science, Aalborg University
Venue:
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Year:
2011

Citing 11
Cited 4

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
pygrametl: a powerful programming framework for extract-transform-load programmers

Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment

Easy and effective parallel programmable ETL

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
MapReduce-based dimensional ETL made easy

Proceedings of the VLDB Endowment
Daisy: the center for data-intensive systems at Aalborg University

ACM SIGMOD Record
Scheduling strategies for efficient ETL execution

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with aMapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.