Batch is Back: CasJobs, Serving Multi-TB Data on the Web

Authors:
William O'Mullane;Nolan Li;Maria Nieto-Santisteban;Alex Szalay;Ani Thakar
Affiliations:
Johns Hopkins University;Johns Hopkins University;Johns Hopkins University;Johns Hopkins University;Johns Hopkins University
Venue:
ICWS '05 Proceedings of the IEEE International Conference on Web Services
Year:
2005

Citing 0
Cited 5

Alternative Software Stacks for OGSA-based Grids

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Scalable community-driven data sharing in e-science grids

Future Generation Computer Systems
Workload-aware data partitioning in community-driven data grids

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DoS: an efficient scheme for the diversification of multiple search results

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Sloan Digital Sky Survey (SDSS) science database describesover 230 million objects and is over 1.6 TB in size. The SDSS CatalogArchive Server (CAS) provides several levels of query interface tothe SDSS data via the SkyServer website. Most queries execute inseconds or minutes. However, some queries can take hours or days,either because they require non-index scans of the largest tables, orbecause they request very large result sets, or because they representvery complex aggregations of the data. These "monster queries" notonly take a long time, they also affect response times for everyone else-- one or more of them can clog the entire system. To ameliorate thisproblem, we developed a multi-server multi-queue batch job submission,execution, and tracking system for the CAS called CasJobs. The transferof very large result sets from queries over the network is another seriousproblem. Statistics suggested that much of this data transfer is unnecessary;users would prefer to store results locally in order to allow further joinsand filtering. To allow local analysis, a system was developed that givesusers their own personal databases (MyDB) at the server side. Usersmay transfer data to their MyDB, and then perform further analysis beforeextracting it to their own machine. MyDB tables also provide a convenientway to share results of queries with collaborators without downloadingthem. CasJobs is built using SOAP XML Web services and has been inoperation since May 2004.