Cheetah: a high performance, custom data warehouse on top of MapReduce

  • Authors:
  • Songting Chen

  • Affiliations:
  • Turn Inc.

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any ad-hoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies.