Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Authors:
Yuting Lin;Divyakant Agrawal;Chun Chen;Beng Chin Ooi;Sai Wu
Affiliations:
National University of Singapore, Singapore, Singapore;University of California, Santa Barbara, Santa Barbara, USA;Zhejiang University, Zhejiang, China;National University of Singapore, Singapore, Singapore;National University of Singapore, Singapore, Singapore
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 20
Cited 18

A decomposition storage model

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
On searching transposed files

ACM Transactions on Database Systems (TODS)
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Optimizing database architecture for the new bottleneck: memory access

The VLDB Journal — The International Journal on Very Large Data Bases
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Sybase IQ multiplex - designed for analytics

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Making cloud intermediate data fault-tolerant

Proceedings of the 1st ACM symposium on Cloud computing
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

IEEE Transactions on Knowledge and Data Engineering

Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Efficient parallel kNN joins for large data in MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
The equi-join processing and optimization on ring architecture key/value database

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Database research at the National University of Singapore

ACM SIGMOD Record
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Understanding insights into the basic structure and essential issues of table placement methods in clusters

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.