Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey

Authors:
Alexander S. Szalay;Peter Z. Kunszt;Ani Thakar;Jim Gray;Don Slutz;Robert J. Brunner
Affiliations:
Dept. of Physics and Astronomy, The Johns Hopkins University, Baltimore, MD;Dept. of Physics and Astronomy, The Johns Hopkins University, Baltimore, MD;Dept. of Physics and Astronomy, The Johns Hopkins University, Baltimore, MD;Microsoft Research, San Francisco, CA;Microsoft Research, San Francisco, CA;California Institute of Technology, Pasadena, CA
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 10
Cited 51

Applications of spatial data structures: Computer graphics, image processing, and GIS

Applications of spatial data structures: Computer graphics, image processing, and GIS
The design and analysis of spatial data structures

The design and analysis of spatial data structures
Readings in object-oriented database systems

Readings in object-oriented database systems
Parallel database systems: the future of high performance database systems

Communications of the ACM
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
The SEQUOIA 2000 storage benchmark

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Loading databases using dataflow parallelism

ACM SIGMOD Record
Broadcast disks: data management for asymmetric communication environments

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture

The SDSS skyserver: public access to the sloan digital sky server data

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Window Query Processing in Linear Quadtrees

Distributed and Parallel Databases
SP-GiST: An Extensible Database Index for Supporting Space Partitioning Trees

Journal of Intelligent Information Systems
Approximated trial and error analysis in scientific databases

Information Systems - Special issue: Best papers from EDBT 2002
Optimizing Scientific Databases for Client Side Data Processing

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Distributed Query Processing on the Grid

GRID '02 Proceedings of the Third International Workshop on Grid Computing
Mining of Topographic Feature from Heterogeneous Imagery and Its Application to Lunar Craters

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Bitmap Indices for Speeding Up High-Dimensional Data Analysis

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Distributed Computing with Load-Managed Active Storage

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Disk Allocation for Fast Range and Nearest-Neighbor Queries

Distributed and Parallel Databases
Scientific data repositories: designing for a moving target

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Migrating a Multiterabyte Archive from Object to Relational Databases

Computing in Science and Engineering
A case for fractured mirrors

The VLDB Journal — The International Journal on Very Large Data Bases
References

Grid resource management
Optimizing candidate check costs for bitmap indices

Proceedings of the 14th ACM international conference on Information and knowledge management
Optimized Data Loading for a Multi-Terabyte Sky Survey Repository

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Summarizing level-two topological relations in large spatial datasets

ACM Transactions on Database Systems (TODS)
Availability of multi-object operations

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Optimal inter-object correlation when replicating for availability

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A case for fractured mirrors

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient exploration of large scientific databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Multiscale histograms: summarizing topological relations in large spatial datasets

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
GridDB: a data-centric overlay for scientific grids

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Modeling and querying vague spatial objects using shapelets

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient evaluation of radial queries using the target tree

International Journal of Bioinformatics Research and Applications
Data exploration of turbulence simulations using a database cluster

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A "Gap Bridger"

ACM SIGMOD Record - Tribute to honor Jim Gray
The Sloan Digital Sky Survey and beyond

ACM SIGMOD Record - Tribute to honor Jim Gray
Jim Gray, astronomer

Communications of the ACM - Remembering Jim Gray
Making a cloud provenance-aware

TAPP'09 First workshop on on Theory and practice of provenance
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
A demonstration of SciDB: a science-oriented DBMS

Proceedings of the VLDB Endowment
ROARS: a scalable repository for data intensive scientific computing

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Scientometrics of big science: a case study of research in the Sloan Digital Sky Survey

Scientometrics
Data model for scientific models and hypotheses

The evolution of conceptual modeling
Human-centered visualization environments

Human-centered visualization environments
Modelling and solving grid resource allocation problem with network resources for workflow applications

Journal of Scheduling
Implementing a general spatial indexing library for relational databases of large numerical simulations

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Summarizing spatial relations – a hybrid histogram

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
An architecture for a data-intensive computer

Proceedings of the first international workshop on Network-aware data management
Stargazing through a digital veil: managing a large scale sky survey using distributed databases on HPC clusters

Proceedings of the first annual workshop on High performance computing meets databases
ANDROMEDA: building e-science data integration tools

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
DRO+: a systemic and economical approach to improve availability of massive database systems

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Research and implement of real-time data loading system IMIL

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Navigating oceans of data

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
ROARS: a robust object archival system for data intensive scientific computing

Distributed and Parallel Databases
CLARO: modeling and processing uncertain data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Supporting user-defined functions on uncertain data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions.The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow de-bugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.