Scalable and Distributed Processing of Scientific XML Data

Authors:
Elif Dede;Zacharia Fadika;Chaitali Gupta;Madhusudhan Govindaraju
Affiliations:
-;-;-;-
Venue:
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Year:
2011

Citing 9
Cited 1

Index Structures for Path Expressions

ICDT '99 Proceedings of the 7th International Conference on Database Theory
D(k)-index: an adaptive structural summary for graph-structured data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Improving Performance of Web Services Query Matchmaking with Automated Knowledge Acquisition

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Semantic Framework for Free-Form Search of Grid Resources

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Spyglass: fast, scalable metadata search for large-scale storage systems

FAST '09 Proccedings of the 7th conference on File and storage technologies
Experiences on Processing Spatial Data with MapReduce

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Using Index in the MapReduce Framework

APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

Experiment explorer: lightweight provenance search over metadata

TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.