Trojan data layouts: right shoes for a running elephant

Authors:
Alekh Jindal;Jorge-Arnulfo Quiané-Ruiz;Jens Dittrich
Affiliations:
Saarland University;Saarland University;Saarland University
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 30
Cited 11

Vertical partitioning algorithms for database design

ACM Transactions on Database Systems (TODS)
Database partitioning in a cluster of processors

ACM Transactions on Database Systems (TODS)
A decomposition storage model

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Integrating vertical and horizontal partitioning into automated physical database design

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
A case for fractured mirrors

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Data morphing: an adaptive, cache-conscious storage technique

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Self-tuning database systems: a decade of progress

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Column-stores vs. row-stores: how different are they really?

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to Information Retrieval

Introduction to Information Retrieval
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Column-oriented database systems

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Energy management for MapReduce clusters

Proceedings of the VLDB Endowment
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
HYRISE: a main memory hybrid storage engine

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
ES2: A cloud data storage system for supporting both OLTP and OLAP

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Modeling I/O interference for data intensive distributed applications

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Mosquito: another one bites the data upload stream

Proceedings of the VLDB Endowment
A comparison of knives for bread slicing

Proceedings of the VLDB Endowment
Understanding insights into the basic structure and essential issues of table placement methods in clusters

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.