Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Authors:
Elif Dede;Madhusudhan Govindaraju;Daniel Gunter;Richard Shane Canon;Lavanya Ramakrishnan
Affiliations:
Binghamton University, Binghamton, NY, USA;Binghamton University, Binghamton, NY, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA
Venue:
Proceedings of the 4th ACM workshop on Scientific cloud computing
Year:
2013

Citing 12
Cited 0

Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Cassandra: structured storage system on a P2P network

Proceedings of the 28th ACM symposium on Principles of distributed computing
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing

The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing
LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories

DEXA '11 Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications
NoSQL databases: a step to database scalability in web environment

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Can the elephants handle the NoSQL onslaught?

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.