X-HYBRIDJOIN for near-real-time data warehousing

Authors:
Muhammad Asif Naeem;Gillian Dobbie;Gerald Weber
Affiliations:
Department of Computer Science, The University of Auckland, Auckland, New Zealand;Department of Computer Science, The University of Auckland, Auckland, New Zealand;Department of Computer Science, The University of Auckland, Auckland, New Zealand
Venue:
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Year:
2011

Citing 11
Cited 3

Efficient resumption of interrupted warehouse loads

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Performance Issues in Incremental Warehouse Maintenance

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions
ETL queues for active data warehousing

Proceedings of the 2nd international workshop on Information quality in information systems
The Long Tail: Why the Future of Business Is Selling Less of More

The Long Tail: Why the Future of Business Is Selling Less of More
Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

IEEE Transactions on Knowledge and Data Engineering
An Event-Based Near Real-Time Data Integration Architecture

EDOCW '08 Proceedings of the 2008 12th Enterprise Distributed Object Computing Conference Workshops
Stream warehousing with DataDepot

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A partition-based approach to support streaming updates over persistent data in an active datawarehouse

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Note on random addressing techniques

IBM Systems Journal
R-MESHJOIN for near-real-time data warehousing

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP

Resource optimization for processing of stream data in data warehouse environment

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Optimised X-HYBRIDJOIN for near-real-time data warehousing

ADC '12 Proceedings of the Twenty-Third Australasian Database Conference - Volume 124
Active XML-based Web data integration

Information Systems Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-realtime data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm Mesh Join (MESHJOIN) has been proposed to amortize disk access over fast stream. MESHJOIN makes no assumptions about the data distribution. In real world applications, however, skewed distributions can be found, e.g, certain products are sold more frequently than the remainder of the products. The question arises, how much does MESHJOIN loose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be used by non-adaptive approaches such as MESHJOIN.