Understanding insights into the basic structure and essential issues of table placement methods in clusters

Authors:
Yin Huai;Siyuan Ma;Rubao Lee;Owen O'Malley;Xiaodong Zhang
Affiliations:
Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Hortonworks Inc.;Department of Computer Science and Engineering, The Ohio State University
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 9
Cited 0

On the design of a new Linux readahead framework

ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
YSmart: Yet Another SQL-to-MapReduce Translator

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Mastiff: A MapReduce-based System for Time-Based Big Data Analytics

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

A table placement method is a critical component in big data analytics on distributed systems. It determines the way how data values in a two-dimensional table are organized and stored in the underlying cluster. Based on Hadoop computing environments, several table placement methods have been proposed and implemented. However, a comprehensive and systematic study to understand, to compare, and to evaluate different table placement methods has not been done. Thus, it is highly desirable to gain important insights into the basic structure and essential issues of table placement methods in the context of big data processing infrastructures. In this paper, we present such a study. The basic structure of a data placement method consists of three core operations: row reordering, table partitioning, and data packing. All the existing placement methods are formed by these core operations with variations made by the three key factors: (1) the size of a horizontal logical subset of a table (or the size of a row group), (2) the function of mapping columns to column groups, and (3) the function of packing columns or column groups in a row group into physical blocks. We have designed and implemented a benchmarking tool to provide insights into how variations of each factor affect the I/O performance of reading data of a table stored by a table placement method. Based on our results, we give suggested actions to optimize table reading performance. Results from large-scale experiments have also confirmed that our findings are valid for production workloads. Finally, we present ORC File as a case study to show the effectiveness of our findings and suggested actions.