Optimized data placement for column-oriented data store in the distributed environment

Authors:
Minqi Zhou;Chen Xu
Affiliations:
Massive Computing Institute, East China Normal University Shanghai, China;Massive Computing Institute, East China Normal University Shanghai, China
Venue:
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Year:
2011

Citing 8
Cited 0

Caching in the Sprite network file system

ACM Transactions on Computer Systems (TOCS)
Disconnected operation in the Coda File System

ACM Transactions on Computer Systems (TOCS)
A decomposition storage model

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
A Conversation with Jim Gray

Queue - Storage
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Column-oriented data storage becomes a buzzword nowadays for its high efficiency in massive data access, high compression ratio on individual columns and etc. However, the initial observations turn out to not be trivially true. The seek time and bandwidth of current hard disk drivers (HDD) become the bottleneck for massive data processing day by day, when comparing to other component enhancements of computers during the past four decades. In this paper, we provide a novel data placement strategy for massive data analysis (i.e., readoptimized) based on Gray Code, which enhances the ratio of sequential access to a great extent for diverse query evaluations (e.g., range query, partial match range query, aggregation query and etc). A centralized/distributed structured index is employed in the popularly deployed distributed file systems (e.g., GFS), which achieves the convenient management, efficient accessibility, high extendibility and etc. Detailed theoretical analysis on index extendibility, sequential access improvement and storage capacity usage in terms of proposed data placement strategies are provided as well as specific algorithms. Our extensive experimental studies confirm the efficiency and effectiveness of our proposed data placement methods.