Scalable, asynchronous, distributed eigen monitoring of astronomy data streams

  • Authors:
  • Kanishka Bhaduri;Kamalika Das;Kirk Borne;Chris Giannella;Tushar Mahule;Hillol Kargupta

  • Affiliations:
  • Mission Critical Technologies Inc., NASA Ames Research Center, MS 269-1, Moffett Field, CA 94035, USA;Stinger Ghaffarian Technologies Inc., NASA Ames Research Center, MS 269-3, Moffett Field, CA 94035, USA;Computational and Data Sciences Department, GMU, VA 22030, USA;The MITRE Corporation, 300 Sentinel Dr. Suite 600, Annapolis Junction MD 20701, USA;CSEE Department, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, USA;CSEE Department, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, USA and AGNIK, LLC., 8840 Stanford Blvd., Suite 1300 Columbia, MD 21045, USA

  • Venue:
  • Statistical Analysis and Data Mining
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we develop a distributed algorithm for monitoring the principal components (PCs) for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeated images of the night sky every 20 s, thereby generating 30 terabytes of calibrated imagery every night that will need to be co-analyzed with other astronomical data stored at different locations around the world. Event detection, classification, and isolation in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such high-throughput distributed data streams. In this paper, we propose a highly scalable and distributed asynchronous algorithm for monitoring the PCs of such dynamic data streams and discuss a prototype web-based system PADMINI (Peer-to-Peer Astronomy Data Mining) which implements this algorithm for use by the astronomers. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish well-known astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e., converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2011 (A shorter version of this paper was published in SIAM Data Mining Conference 2009.)