MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Google Guice: Agile Lightweight Dependency Injection Framework (Firstpress)
Google Guice: Agile Lightweight Dependency Injection Framework (Firstpress)
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Stream warehousing with DataDepot
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
DEDUCE: at the intersection of MapReduce and stream processing
Proceedings of the 13th International Conference on Extending Database Technology
An introduction to Microsoft SQL server StreamInsight
Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
S4: Distributed Stream Computing Platform
ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
A Performance Comparison of Web Service Object Marshalling and Unmarshalling Solutions
SERVICES '11 Proceedings of the 2011 IEEE World Congress on Services
SAMOA: a platform for mining big data streams
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
The burgeoning field of data science benefits from the application of a variety of analytic models and techniques to the oft-cited problems of large volume, high velocity data rates, and significant variety in data structure and semantics. Many approaches make use of common analytic techniques in either a streaming or batch processing paradigm. This paper presents progress in developing a framework for the analysis of large-scale datasets using both of these pools of techniques in a unified manner. This includes: (1) a Domain Specific Language (DSL) for describing analyses as a set of Communicating Sequential Processes, fully integrated with the Java type system, including an Integrated Development Environment (IDE) and a compiler which builds idiomatic Java; (2) a runtime model for execution of an analytic in both streaming and batch environments; and (3) a novel approach to automated management of cell-level security labels, applied uniformly across all runtimes. The paper concludes with a demonstration of the successful use of this system with a sample workload developed in (1), and an analysis of the performance characteristics of each of the runtimes described in (2).