Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

  • Authors:
  • Elif Dede;Madhusudhan Govindaraju;Daniel Gunter;Richard Shane Canon;Lavanya Ramakrishnan

  • Affiliations:
  • Binghamton University, Binghamton, NY, USA;Binghamton University, Binghamton, NY, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA

  • Venue:
  • Proceedings of the 4th ACM workshop on Scientific cloud computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.