Scalable splitting of massive data streams

Authors:
Erik Zeitler;Tore Risch
Affiliations:
Department of Information Technology, Uppsala University, Sweden;Department of Information Technology, Uppsala University, Sweden
Venue:
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
Year:
2010

Citing 20
Cited 4

Principles of distributed database systems (2nd ed.)

Principles of distributed database systems (2nd ed.)
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic Load Distribution in the Borealis Stream Processor

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Customizable parallel execution of scientific stream queries

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Processing High-Volume Stream Queries on a Supercomputer

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Run-time operator state spilling for memory intensive long-running queries

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Contract-based load management in federated distributed systems

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using stream queries to measure communication performance of a parallel computing environment

ICDCSW '07 Proceedings of the 27th International Conference on Distributed Computing Systems Workshops
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Linear road: a stream data management benchmark

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Highly scalable trip grouping for large-scale collective transportation systems

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Query-aware partitioning for monitoring massive network data streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Toward massive query optimization in large-scale distributed stream systems

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
XStream: a Signal-Oriented Data Stream Management System

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Thread cooperation in multicore architectures for frequency counting over multiple data streams

Proceedings of the VLDB Endowment
Efficient dynamic operator placement in a locally distributed continuous query system

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I

Virtualizing stream processing

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Virtualizing stream processing

Proceedings of the 12th International Middleware Conference
Efficient ESL-Event-to-SQL translation

IScIDE'12 Proceedings of the third Sino-foreign-interchange conference on Intelligent Science and Intelligent Data Engineering
A performance analysis of system s, s4, and esper via two level benchmarking

QEST'13 Proceedings of the 10th international conference on Quantitative Evaluation of Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.