Efficiently querying archived data using Hadoop

  • Authors:
  • Rajeev Gupta;Himanshu Gupta;Ullas Nambiar;Mukesh Mohania

  • Affiliations:
  • IBM Research, Delhi, India;IBM Research, Delhi, India;IBM Research, Delhi, India;IBM Research, Delhi, India

  • Venue:
  • CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, telecom network monitoring etc., is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. As data sizes grow the movement of data from the warehouse to archives becomes more frequent. Current file based archive models make the archived data unusable for any type of insight extraction. In this paper, we present an active archival solution for data warehouses that makes use of Hadoop distributed file system (HDFS) to store the data in an always available and cost-effective manner. We investigate various structured data storage schemes within HDFS and empirical evaluations show that a combination of Universal scheme model and column store is best suited for the active archival solution.