Scalable queries for large datasets using cloud computing: a case study

Authors:
James P. McGlothlin;Latifur Khan
Affiliations:
The University of Texas at Dallas, Richardson, TX;The University of Texas at Dallas, Richardson, TX
Venue:
Proceedings of the 15th Symposium on International Database Engineering & Applications
Year:
2011

Citing 12
Cited 0

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
Cassandra: structured storage system on a P2P network

Proceedings of the 28th ACM symposium on Principles of distributed computing
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
A view of cloud computing

Communications of the ACM
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

Cloud computing is rapidly growing in popularity as a solution for processing and retrieving huge amounts of data over clusters of inexpensive commodity hardware. The most common data model utilized by cloud computing software is the NoSQL data model. While this data model is extremely scalable, it is much more efficient for simple retrievals and scans than for the complex analytical queries typical in a relational database model. In this paper, we evaluate emerging cloud computing technologies using a representative use case. Our use case involves analyzing telecommunications logs for performance monitoring and quality assurance. Clearly, the size of such logs is growing exponentially as more devices communicate more frequently and the amount of data being transferred steadily increases. We analyze potential solutions to provide a scalable database which supports both retrieval and analysis. We will investigate and analyze all the major open source cloud computing solutions and designs. We then choose the most applicable subset of these technologies for experimentation. We provide a performance evaluation of these products, and we analyze our results and make recommendations. This paper provides a comprehensive survey of technologies for scalable data processing and an in-depth performance evaluation of these technologies.