Future Generation Computer Systems - Special issue on HPCN96
The multidimensional database system RasDaMan
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MonetDB/SQL Meets SkyServer: the Challenges of a Scientific Database
SSDBM '07 Proceedings of the 19th International Conference on Scientific and Statistical Database Management
MapReduce for Data Intensive Scientific Analyses
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Distributed data-parallel computing using a high-level programming language
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
DryadLINQ for Scientific Analyses
E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Overview of sciDB: large scale array storage, processing and analysis
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
A model of computation for MapReduce
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Breaking the MapReduce Stage Barrier
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Parallel accessing massive NetCDF data based on mapreduce
WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
MapReduce in the Clouds for Science
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Hybrid merge/overlap execution technique for parallel array processing
Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Filtering: a method for solving graph problems in MapReduce
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Experiences using cloud computing for a scientific workflow application
Proceedings of the 2nd international workshop on Scientific cloud computing
SciHadoop: array-based query processing in Hadoop
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Case study of scientific data processing on a cloud using hadoop
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Riding the elephant: managing ensembles with hadoop
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Sailfish: a framework for large scale data processing
Proceedings of the Third ACM Symposium on Cloud Computing
Hi-index | 0.00 |
The MapReduce framework is being extended for domains quite different from the web applications for which it was designed, including the processing of big structured data, e.g., scientific and financial data. Previous work using MapReduce to process scientific data ignores existing structure when assigning intermediate data and scheduling tasks. In this paper, we present a method for incorporating knowledge of the structure of scientific data and executing query into the MapReduce communication model. Built in SciHadoop, a version of the Hadoop MapReduce framework for scientific data, SIDR intelligently partitions and routes intermediate data, allowing it to: remove Hadoop's global barrier and execute Reduce tasks prior to all Map tasks completing; minimize intermediate key skew; and produce early, correct results. SIDR executes queries up to 2.5 times faster than Hadoop and 37% faster than SciHadoop; produces initial results with only 6% of the query completed; and produces dense, contiguous output.