Cheetah: a high performance, custom data warehouse on top of MapReduce

Authors:
Songting Chen
Affiliations:
Turn Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 17
Cited 18

Continuously adaptive continuous queries over streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Answering queries using views: A survey

The VLDB Journal — The International Journal on Very Large Data Bases
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
On-the-fly sharing for streamed aggregation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
How to wring a table dry: entropy compression of relations and querying of compressed relations

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
On the correctness criteria of fine-grained access control in relational databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Column-stores vs. row-stores: how different are they really?

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
A Hadoop based distributed loading approach to parallel data warehouses

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
Optimizing queries with expensive video predicates in cloud environment

Concurrency and Computation: Practice & Experience
FedDW global schema architect: UML-based design tool for the integration of data mart schemas

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
SemanMR: big data processing framework based on semantics

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Overview of turn data management platform for digital advertising

Proceedings of the VLDB Endowment
Understanding insights into the basic structure and essential issues of table placement methods in clusters

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any ad-hoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies.