Continuously adaptive continuous queries over streams
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Weaving Relations for Cache Performance
Proceedings of the 27th International Conference on Very Large Data Bases
Answering queries using views: A survey
The VLDB Journal — The International Journal on Very Large Data Bases
k-anonymity: a model for protecting privacy
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
On-the-fly sharing for streamed aggregation
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Integrating compression and execution in column-oriented database systems
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
How to wring a table dry: entropy compression of relations and querying of compressed relations
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
On the correctness criteria of fine-grained access control in relational databases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Column-stores vs. row-stores: how different are they really?
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
A comparison of join algorithms for log processing in MaPreduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Column-oriented storage techniques for MapReduce
Proceedings of the VLDB Endowment
A Hadoop based distributed loading approach to parallel data warehouses
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CoHadoop: flexible data placement and its exploitation in Hadoop
Proceedings of the VLDB Endowment
Trojan data layouts: right shoes for a running elephant
Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Clydesdale: structured data processing on MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
Only aggressive elephants are fast elephants
Proceedings of the VLDB Endowment
Optimizing queries with expensive video predicates in cloud environment
Concurrency and Computation: Practice & Experience
FedDW global schema architect: UML-based design tool for the integration of data mart schemas
Proceedings of the fifteenth international workshop on Data warehousing and OLAP
SemanMR: big data processing framework based on semantics
Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Piranha: optimizing short jobs in Hadoop
Proceedings of the VLDB Endowment
Overview of turn data management platform for digital advertising
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any ad-hoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies.