Vertical partitioning algorithms for database design
ACM Transactions on Database Systems (TODS)
Database partitioning in a cluster of processors
ACM Transactions on Database Systems (TODS)
SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
Weaving Relations for Cache Performance
Proceedings of the 27th International Conference on Very Large Data Bases
Integrating vertical and horizontal partitioning into automated physical database design
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
C-store: a column-oriented DBMS
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Data morphing: an adaptive, cache-conscious storage technique
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Self-tuning database systems: a decade of progress
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Column-stores vs. row-stores: how different are they really?
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to Information Retrieval
Introduction to Information Retrieval
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
Column-oriented database systems
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
HadoopToSQL: a mapReduce query optimizer
Proceedings of the 5th European conference on Computer systems
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data warehousing and analytics infrastructure at facebook
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Manimal: relational optimization for data-intensive programs
Procceedings of the 13th International Workshop on the Web and Databases
Energy management for MapReduce clusters
Proceedings of the VLDB Endowment
Runtime measurements in the cloud: observing, analyzing, and reducing variance
Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce
Proceedings of the VLDB Endowment
HYRISE: a main memory hybrid storage engine
Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce
Proceedings of the VLDB Endowment
ES2: A cloud data storage system for supporting both OLTP and OLAP
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Only aggressive elephants are fast elephants
Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
CARTILAGE: adding flexibility to the Hadoop skeleton
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Modeling I/O interference for data intensive distributed applications
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Mosquito: another one bites the data upload stream
Proceedings of the VLDB Endowment
A comparison of knives for bread slicing
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.