Efficiently querying archived data using Hadoop

Authors:
Rajeev Gupta;Himanshu Gupta;Ullas Nambiar;Mukesh Mohania
Affiliations:
IBM Research, Delhi, India;IBM Research, Delhi, India;IBM Research, Delhi, India;IBM Research, Delhi, India
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 3
Cited 0

The Universal-Relation Data Model for Logical Independence

IEEE Software
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Why you should run TPC-DS: a workload analysis

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, telecom network monitoring etc., is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. As data sizes grow the movement of data from the warehouse to archives becomes more frequent. Current file based archive models make the archived data unusable for any type of insight extraction. In this paper, we present an active archival solution for data warehouses that makes use of Hadoop distributed file system (HDFS) to store the data in an always available and cost-effective manner. We investigate various structured data storage schemes within HDFS and empirical evaluations show that a combination of Universal scheme model and column store is best suited for the active archival solution.