Optimized data placement for column-oriented data store in the distributed environment

  • Authors:
  • Minqi Zhou;Chen Xu

  • Affiliations:
  • Massive Computing Institute, East China Normal University Shanghai, China;Massive Computing Institute, East China Normal University Shanghai, China

  • Venue:
  • DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Column-oriented data storage becomes a buzzword nowadays for its high efficiency in massive data access, high compression ratio on individual columns and etc. However, the initial observations turn out to not be trivially true. The seek time and bandwidth of current hard disk drivers (HDD) become the bottleneck for massive data processing day by day, when comparing to other component enhancements of computers during the past four decades. In this paper, we provide a novel data placement strategy for massive data analysis (i.e., readoptimized) based on Gray Code, which enhances the ratio of sequential access to a great extent for diverse query evaluations (e.g., range query, partial match range query, aggregation query and etc). A centralized/distributed structured index is employed in the popularly deployed distributed file systems (e.g., GFS), which achieves the convenient management, efficient accessibility, high extendibility and etc. Detailed theoretical analysis on index extendibility, sequential access improvement and storage capacity usage in terms of proposed data placement strategies are provided as well as specific algorithms. Our extensive experimental studies confirm the efficiency and effectiveness of our proposed data placement methods.