A new operator for efficient stream-relation join processing in data streaming engines

Authors:
Roozbeh Derakhshan;Abdul Sattar;Bela Stantic
Affiliations:
Griffith University, Brisbane , Australia;Griffith University, Brisbane , Australia;Griffith University, Brisbane , Australia
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 5
Cited 0

Exploiting k-constraints to reduce memory overhead in continuous queries over data streams

ACM Transactions on Database Systems (TODS)
Out-of-order processing: a new architecture for high-performance stream systems

Proceedings of the VLDB Endowment
A partition-based approach to support streaming updates over persistent data in an active datawarehouse

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
SECRET: a model for analysis of the execution semantics of stream processing systems

Proceedings of the VLDB Endowment
Semi-Streamed Index Join for near-real time execution of ETL transformations

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last decade, Stream Processing Engines (SPEs) have emerged as a new processing paradigm that can process huge amounts of data while retaining low latency and high-throughputs. Yet, it is often necessary to join streaming data with traditional databases to provide more contextual information for the end-users and applications. The major problem that we confront is to join the fast arriving stream tuples with the static relation tuples that are on a slow database. This is what we call the Stream-Relation Join (SRJ) problem. Currently, SPEs use a naive tuple-by-tuple approach for SRJ processing where the SPE accesses the database for every incoming tuple. Some SPEs use cache to avoid accessing the database for every incoming tuple, while others do not because of the stochastic nature of streaming data. In this paper, we propose a new SRJ operator to facilitate SRJ processing regardless of the cache performance using two techniques: batching and out-of-order processing. The proposed operator provides an effective generic solution to the SRJ problem and the cost of incorporating our operator into different SPEs is minimal. Our experiments use a variety of synthetic and real data sets demonstrating that our operator outperforms the state-of-the-art tuple-by-tuple approach in terms of maximizing the throughput under ordering and memory constraints.