Semi-Streamed Index Join for near-real time execution of ETL transformations

Authors:
Mihaela A. Bornea;Antonios Deligiannakis;Yannis Kotidis;Vasilis Vassalos
Affiliations:
Athens U. of Econ and Business, Greece;Technical University of Crete, Greece;Athens U. of Econ and Business, Greece;Athens U. of Econ and Business, Greece
Venue:
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Year:
2011

Citing 0
Cited 3

A lightweight stream-based join with limited resource consumption

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
A new operator for efficient stream-relation join processing in data streaming engines

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A generic front-stage for semi-stream processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Active data warehouses have emerged as a new business intelligence paradigm where data in the integrated repository is refreshed in near real-time. This shift of practices achieves higher consistency between the stored information and the latest updates, which in turn influences crucially the output of decision making processes. In this paper we focus on the changes required in the implementation of Extract Transform Load (ETL) operations which now need to be executed in an online fashion. In particular, the ETL transformations frequently include the join between an incoming stream of updates and a disk-resident table of historical data or metadata. In this context we propose a novel Semi-Streaming Index Join (SSIJ) algorithm that maximizes the throughput of the join by buffering stream tuples and then judiciously selecting how to best amortize expensive disk seeks for blocks of the stored relation among a large number of stream tuples. The relation blocks required for joining with the stream are loaded from disk based on an optimal plan. In order to maximize the utilization of the available memory space for performing the join, our technique incorporates a simple but effective cache replacement policy for managing the retrieved blocks of the relation. Moreover, SSIJ is able to adapt to changing characteristics of the stream (i.e. arrival rate, data distribution) by dynamically adjusting the allocated memory between the cached relation blocks and the stream. Our experiments with a variety of synthetic and real data sets demonstrate that SSIJ consistently outperforms the state-of-the-art algorithm in terms of the maximum sustainable throughput of the join while being also able to accommodate deadlines on stream tuple processing.