RRPJ: result-rate based progressive relational join

  • Authors:
  • Wee Hyong Tok;Stéphane Bressan;Mong-Li Lee

  • Affiliations:
  • School of Computing, National University of Singapore;School of Computing, National University of Singapore;School of Computing, National University of Singapore

  • Venue:
  • DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Progressive join algorithms are join algorithms that produce results incrementally as input data is available. Because they are nonblocking, they are particularly suitable for online processing of data streams. Reference algorithms of this family are the symmetric hash join, the X-join and more recently, the rate-based progressive join (RPJ). While the symmetric hash join introduces the idea of a symmetric processing of the input streams but assumes sufficient main memory, the X-Join suggests that the processing can scale to very large amounts of data if main memory is regularly flushed to disk, and a reactive/cleanup phase is triggered for disk-resident data. The X-join flushing strategy is based on a simple largest-first strategy, where the largest partition is flushed to disk. The recently proposed RPJ predicts the main memory tuples or partitions that should be flushed to disk in order to maximize throughput by computing their probabilities to contribute to a result. In this paper, we discuss the limitations of RPJ and propose a novel extension, called Result Rate-based Progressive Join (RRPJ), which addresses these limitations. Instead of computing the probabilities from statistics over the input data, RRPJ directly observes the output (result) statistics. This not only yields a better performance, but also simplifies the generalization of the algorithm to non-relational data such as multidimensional data and hierarchical data. We empirically show that RRPJ is effective and efficient and outperforms the state-of-art RPJ. We also investigate the relevance and performance of an adaptive version of these algorithms using amortization parameters.