TJJE: An efficient algorithm for top-k join on massive data

  • Authors:
  • Xixian Han;Jianzhong Li;Jinbao Wang;Donghua Yang

  • Affiliations:
  • School of Computer Science and Technology, Harbin Institute of Technology, China;School of Computer Science and Technology, Harbin Institute of Technology, China;School of Computer Science and Technology, Harbin Institute of Technology, China;The Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, China

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.07

Visualization

Abstract

In many applications, top-k join is an important operation to return the k most important join tuples among the potentially huge answer space according to a given ranking function. PBRJ is an algorithm template that generalizes previous top-k join algorithms. In this paper, our analysis shows that PBRJ needs to maintain a large quantity of candidate tuples on massive data. Based on the analysis, this paper proposes a novel top-k join algorithm TJJE which is suitable for handling massive data. By some pre-computed information, TJJE first estimates an upper-bound on scan depth of each joined table. Then it determines the file that contains the join positional index pairs of the top-k join results. A novel algorithm is proposed to retrieve the required join tuples by a single sequential and selective scan on the joined tables. Finally, the top-k join results are obtained by a single scan on the retrieved join tuples. The correctness proof and cost analysis of TJJE are presented in this paper. Extensive experiments show that TJJE maintains up to three orders of magnitude fewer candidate tuples and obtains up to one order of magnitude speedup compared to PBRJ.