Parallel and Distributed Astronomical Data Analysis on Grid Datafarm

Authors:
Naotaka Yamamoto;Osamu Tatebe;Satoshi Sekiguchi
Affiliations:
Grid Technology Research Center, AIST, Japan;Grid Technology Research Center, AIST, Japan;Grid Technology Research Center, AIST, Japan
Venue:
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Year:
2004

Citing 3
Cited 4

Grid Datafarm Architecture for Petascale Data Intensive Computing

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
The Second Trans-Pacific Grid Datafarm Testbed and Experiments for SC2003

SAINT-W '04 Proceedings of the 2004 Symposium on Applications and the Internet-Workshops (SAINT 2004 Workshops)
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4

A taxonomy of Data Grids for distributed data sharing, management, and processing

ACM Computing Surveys (CSUR)
An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids

Journal of Parallel and Distributed Computing
A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Transparent on-demand co-allocation data access for grids

International Journal of Ad Hoc and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A comprehensive study of the whole petabyte-scale archival data of astronomical observatories has a possibility of new science and new knowledge in the field, while it was not feasible so far due to lack of enough data analysis environment. The Grid Datafarm architecture is designed for global petabyte-scale data-intensive computing, which provides a Grid file system with file replica management for fault tolerance and load balancing, and parallel and distributed data computing support for a set of files, to meet with the requirements of the comprehensive study of the whole archival data. In the paper, we discuss about worldwide parallel and distributed data analysis in the observational astronomical field. The archival data is stored, replicated and dispersed in a Gfarm file system. All the astronomical data analysis tools successfully access files in Gfarm file system without any code modification, using a syscall hooking library regardless of file replica locations. Performance evaluation of the parallel data analysis in several ways shows file-affinity process scheduling plays an essential role for scalable and efficient parallel file I/O performance. A data calibration tools shows scalable file I/O performance, and achieved the file I/O performance of 5.9 GB/sec and 4.0 GB/sec for reading and writing FITS files, respectively, using 30 cluster nodes (60 CPUs). On-demand file replica creation mitigates the overhead of access concentration. Another tool shows the performance improvement at a factor of six for reading a shared file by creating file replicas.