Batch is Back: CasJobs, Serving Multi-TB Data on the Web

  • Authors:
  • William O'Mullane;Nolan Li;Maria Nieto-Santisteban;Alex Szalay;Ani Thakar

  • Affiliations:
  • Johns Hopkins University;Johns Hopkins University;Johns Hopkins University;Johns Hopkins University;Johns Hopkins University

  • Venue:
  • ICWS '05 Proceedings of the IEEE International Conference on Web Services
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Sloan Digital Sky Survey (SDSS) science database describesover 230 million objects and is over 1.6 TB in size. The SDSS CatalogArchive Server (CAS) provides several levels of query interface tothe SDSS data via the SkyServer website. Most queries execute inseconds or minutes. However, some queries can take hours or days,either because they require non-index scans of the largest tables, orbecause they request very large result sets, or because they representvery complex aggregations of the data. These "monster queries" notonly take a long time, they also affect response times for everyone else-- one or more of them can clog the entire system. To ameliorate thisproblem, we developed a multi-server multi-queue batch job submission,execution, and tracking system for the CAS called CasJobs. The transferof very large result sets from queries over the network is another seriousproblem. Statistics suggested that much of this data transfer is unnecessary;users would prefer to store results locally in order to allow further joinsand filtering. To allow local analysis, a system was developed that givesusers their own personal databases (MyDB) at the server side. Usersmay transfer data to their MyDB, and then perform further analysis beforeextracting it to their own machine. MyDB tables also provide a convenientway to share results of queries with collaborators without downloadingthem. CasJobs is built using SOAP XML Web services and has been inoperation since May 2004.