DIMM: a distributed metadata management for data-intensive HPC environments

Authors:
Brandon Szeliga;John Cavicchio;Weisong Shi
Affiliations:
Wayne State University, Detroit, MI, USA;Wayne State University, Detroit, MI, USA;Wayne State University, Detroit, MI, USA
Venue:
DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Year:
2008

Citing 12
Cited 0

The high performance storage system

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Data management and transfer in high-performance computational grid environments

Parallel Computing - Parallel data-intensive algorithms and applications
Scientific Workflow Management by Database Management

SSDBM '98 Proceedings of the 10th International Conference on Scientific and Statistical Database Management
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Distributed Job Scheduling on Computational Grids Using Multiple Simultaneous Requests

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
High Performance Storage System Scalability: Architecture, Implementation and Experience

MSST '05 Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies
OpenDHT: a public DHT service and its uses

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Data driven workflow planning in cluster management systems

Proceedings of the 16th international symposium on High performance distributed computing
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Optimizing center performance through coordinated data staging, scheduling and recovery

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Within the scientific community many high performance applications are used in order to run experiments on data sets. These data sets can be very large in size or in number. Both of these situations can cause problems to the centralized manager scheduling the system. In our approach we minimize the role of the manager by using a distributed hash table. This way all files have a given "home" location to be at if they are going to be used which reduces location maintenance. We further reduce the strain on the central manager by using a counter-based bloomfilter. This allows the central manager to quickly and easily see if a given data set exists in the system without having to use the large storage space of a database. By adding false positive detection in the form of locality checks to the bloomfilter, we can reduce the probability of a false positive causing a problem in our system. In this fashion we can move away from a centralized manager to a more distributed one while reducing the amount of additional metadata the manager needs to maintain. With our approach we show that the workload of the centralized manager directory has been reduced significantly. Not only are we capable of reducing the number of files that must be migrated into the system by up to nearly 90% compared to a centralized storage-aware scheduling but it is also possible to achieve greater hit rates than a standard cache if job placement is influenced by data location. We also show that a significant number of false positives can be detected from the bloomfilter (at least 25%) at the cost of allowing smaller false negative instances to occur at an increased rate.