Column-oriented storage techniques for MapReduce

Authors:
Avrilia Floratou;Jignesh M. Patel;Eugene J. Shekita;Sandeep Tata
Affiliations:
University of Wisconsin--Madison;University of Wisconsin--Madison;IBM Almaden, Research Center;IBM Almaden, Research Center
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 19
Cited 19

Skip lists: a probabilistic alternative to balanced trees

Communications of the ACM
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Column-stores vs. row-stores: how different are they really?

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Self-organizing tuple reconstruction in column-stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Speeding up queries in column stores: a case for compression

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Improving the efficiency of subset queries on raster images

Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
Can the elephants handle the NoSQL onslaught?

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
INDREX: in-database distributional relation extraction

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Overview of turn data management platform for digital advertising

Proceedings of the VLDB Endowment
Understanding insights into the basic structure and essential issues of table placement methods in clusters

Proceedings of the VLDB Endowment
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.