Input data organization for batch processing in time window based computations

Authors:
Leonardo Aniello;Leonardo Querzoni;Roberto Baldoni
Affiliations:
University of Rome "La Sapienza", Rome, Italy;University of Rome "La Sapienza", Rome, Italy;University of Rome "La Sapienza", Rome, Italy
Venue:
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Year:
2013

Citing 24
Cited 1

Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Distributed processing of very large datasets with DataCutter

Parallel Computing - Clusters and computational grids for scientific computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Infopipes for composing distributed information flows

M3W Proceedings of the 2001 international workshop on Multimedia middleware
Adaptive load shedding for windowed stream joins

Proceedings of the 14th ACM international conference on Information and knowledge management
Run-time operator state spilling for memory intensive long-running queries

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
Maximizing the output rate of multi-way join queries over streaming information sources

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Load shedding in a data stream manager

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Memory-limited execution of windowed stream joins

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Programming Abstractions for Data Intensive Computing on Clouds and Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A stratified approach for supporting high throughput event processing applications

Proceedings of the Third ACM International Conference on Distributed Event-Based Systems
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Transformer: A New Paradigm for Building Data-Parallel Programming Models

IEEE Micro
Event Processing in Action

Event Processing in Action
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Performance Considerations of Data Acquisition in Hadoop System

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Performance Analysis of Hadoop for Query Processing

WAINA '11 Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications
A collaborative event processing system for protection of critical infrastructures from cyber attacks

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Collaborative Financial Infrastructure Protection: Tools, Abstractions, and Middleware

Collaborative Financial Infrastructure Protection: Tools, Abstractions, and Middleware

Adaptive online scheduling in storm

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications based on event processing are often designed to continuously evaluate set of events defined by sliding time windows. Solutions employing long-running continuous queries executed in-memory show their limits in applications characterized by a staggering growth of available sources that continuously produce new events at high rates (e.g. intrusion detection systems and algorithmic trading). Problems arise due to the complexities in maintaining large amounts of events in memory for continuous elaboration, and due to the difficulties in managing at run-time the network of elaborating nodes. A batch approach to this kind of computation provides a viable solution for scenarios characterized by non frequent computations of very large time windows. In this paper we propose a model for batch processing in time window event computations that allows the definition of multiple metrics for performance optimization. These metrics specifically take into account the organization of input data to minimize its impact on computation latency. The model is then instantiated on Hadoop, a batch processing engine based on the MapReduce paradigm, and a set of strategies for efficiently arranging input data is described and evaluated.