Migrating a (large) science database to the cloud

  • Authors:
  • Ani Thakar;Alex Szalay

  • Affiliations:
  • The Johns Hopkins University, Baltimore, MD;The Johns Hopkins University, Baltimore, MD

  • Venue:
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We report on attempts to put an existing scientific (astronomical) database -- the Sloan Digital Sky Survey (SDSS) science archive [1] - in the cloud. Based on our experience, it is either very frustrating or impossible at this time to migrate an existing, complex SQL Server database into current cloud service offerings such as Amazon (EC2) and Microsoft (SQL Azure). Certainly it is impossible to migrate a large database in excess of a TB, but even with (much) smaller databases, the limitations of cloud services make it very difficult to migrate the data to the cloud without making changes to the schema and settings (for example, inability to migrate a spatial indexing library, and several other user-defined functions and stored procedures) that would invalidate performance comparisons between cloud and on-premise versions. So it is not surprising that our preliminary performance comparisons show a very large (an order of magnitude) performance discrepancy with the Amazon cloud version of the SDSS database. We have also not yet investigated the performance tweaks that could be possible within the cloud. Although we managed to successfully migrate (a subset of) the SDSS catalog database to Amazon EC2, we were not able to access the database in a meaningful way from the outside world. Even though this was advertised as a public dataset on the AWS blog, it was not clear how other users or the public would be able to access this data in a meaningful way, if at all. These difficulties suggest that much work and coordination needs to occur between cloud service providers and their potential database clients before science databases can successfully and effectively be deployed in the cloud. This is true not just for large scientific databases but all databases that make extensive use of advanced database management system (DBMS) features for performance and user convenience.