Trends and outlook for the massive-scale analytics stack

Authors:
A. N. Ghoting;J. A. Gunnels;P. Kambadur;E. P. Pednault;M. S. Squillante
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
IBM Journal of Research and Development
Year:
2013

Citing 20
Cited 0

Analysis of the impact of memory in distributed parallel processing systems

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Programming with POSIX threads

Programming with POSIX threads
Analysis of optimal scheduling in distributed parallel queueing systems

ICCC '95 Proceedings of the 12th international conference on computer communication on Information highways : for a smaller world and better living: for a smaller world and better living
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Parallel programming in OpenMP

Parallel programming in OpenMP
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Database Management Systems

Database Management Systems
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
DSLs in Action

DSLs in Action
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Large-scale distributed non-negative sparse coding and sparse dictionary learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Massive-scale analytics (MSA) applications are characterized by the large amount of data that they process and the complexity of algorithms used to process the data. The ideal MSA system will not only support processing of large amounts of data but also offer a high degree of parallelism and support scheduling and resource allocation of complex workloads. Designers of MSA systems must provide three necessities: programming abstractions, runtime systems, and hardware. Historically, two communities have undertaken the task of designing MSA systems: the database community, which has argued for an SQL (Structured Query Language)-influenced processing paradigm, and the high-performance computing community, which has focused on developing infrastructures for highly efficient, but complex, parallel implementations. These two communities have developed disparate technologies to meet the necessities of MSA systems, and the solutions provided by the individual communities are not completely satisfactory. In this paper, we attempt to characterize the strengths and weaknesses of the approaches of these two communities at all levels of the MSA stack, characterize implications with respect to resource management within the MSA system, and define how an MSA system should be designed.