Distributed scalable collaborative filtering algorithm

  • Authors:
  • Ankur Narang;Abhinav Srivastava;Naga Praveen Kumar Katta

  • Affiliations:
  • IBM India Research Laboratory, New Delhi;IBM India Research Laboratory, New Delhi;IBM India Research Laboratory, New Delhi

  • Venue:
  • Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Collaborative filtering (CF) based recommender systems have gained wide popularity in Internet companies like Amazon, Netflix, Google News, and others. These systems make automatic predictions about the interests of a user by inferring from information about like-minded users. Real-time CF on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel design for soft real-time (less than 10 sec.) distributed co-clustering based Collaborative Filtering algorithm. Our distributed algorithm has been optimized for multi-core cluster architectures using pipelined parallelism, computation communication overlap and communication optimizations. Theoretical parallel time complexity analysis of our algorithm proves the efficacy of our approach. Using the Netflix dataset (100M ratings), we demonstrate the performance and scalability of our algorithm on 1024-node Blue Gene/P system. Our distributed algorithm (implemented using OpenMP with MPI) delivered training time of around 6s on the full Netflix dataset and prediction time of 2.5s on 1.4M ratings (1.78µs per rating prediction). Our training time is around 20× (more than one order of magnitude) better than the best known parallel training time, along with high accuracy (0.87±0.02 RMSE). To the best of our knowledge, this is the best known parallel performance for collaborative filtering on Netflix data at such high accuracy and also the first such implementation on multi-core cluster architectures such as Blue Gene/P.