Index Structures for Path Expressions
ICDT '99 Proceedings of the 7th International Conference on Database Theory
D(k)-index: an adaptive structural summary for graph-structured data
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Improving Performance of Web Services Query Matchmaking with Automated Knowledge Acquisition
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Semantic Framework for Free-Form Search of Grid Resources
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Spyglass: fast, scalable metadata search for large-scale storage systems
FAST '09 Proccedings of the 7th conference on File and storage technologies
Experiences on Processing Spatial Data with MapReduce
SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Using Index in the MapReduce Framework
APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Experiment explorer: lightweight provenance search over metadata
TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
Hi-index | 0.00 |
A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.