Query-aware partitioning for monitoring massive network data streams

Authors:
Theodore Johnson;Muthu S. Muthukrishnan;Vladislav Shkapenyuk;Oliver Spatscheck
Affiliations:
AT&T Labs - Research, Florham Park, NJ, USA;Rutgers University, Piscataway, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 16
Cited 7

Parallel database systems: the future of high performance database systems

Communications of the ACM
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The state of the art in distributed query processing

ACM Computing Surveys (CSUR)
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automating physical database design in a parallel database

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Data Reduction by Partial Preaggregation

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Holistic UDAFs at streaming speeds

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
On scalable attack detection in the network

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
No pane, no gain: efficient evaluation of sliding-window aggregates over data streams

ACM SIGMOD Record
Customizable parallel execution of scientific stream queries

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Contract-based load management in federated distributed systems

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Tribeca: a system for managing large databases of network traffic

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
An integration framework for sensor networks and data stream management systems

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Scaling issues in network monitoring

SSPS '08 Proceedings of the 2nd international workshop on Scalable stream processing system
SLIPstream: scalable low-latency interactive perception on streaming data

Proceedings of the 18th international workshop on Network and operating systems support for digital audio and video
Distributed event stream processing with non-deterministic finite automata

Proceedings of the Third ACM International Conference on Distributed Event-Based Systems
Parallel detection of temporal events from streaming data

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Scalable splitting of massive data streams

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
Dynamic routing of data stream tuples among parallel query plan running on multi-core processors

Distributed and Parallel Databases
Adaptive input admission and management for parallel stream processing

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.