Implementation of data affinity-based distributed parallel processing on a distributed key value store

  • Authors:
  • Naoko Hishinuma;Atsuko Takefusa;Hidemoto nakada;Masato Oguchi

  • Affiliations:
  • Ochanomizu University, Otsuka, Bunkyo-ku, Tokyo, Japan;National Institute of Advanced, Industrial Science and Technology(AIST), Umezono, Tsukuba, Ibaraki, Japan;National Institute of Advanced, Industrial Science and Technology(AIST), Umezono, Tsukuba, Ibaraki, Japan;Ochanomizu University, Otsuka, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

The spread of cloud computing has increased the necessity of accumulating large amounts of data and performing high-speed data processing. Because strict consistency is not necessarily required for such large amount of data that cloud computing stores, a distributed Key Value Stores (KVS) is considered suitable for their data storage, based on an eventual consistency paradigm. In order to provide services such as SNS, mining and statistical processing of these data is indispensable. However because general distributed KVS systems are not designed for processing, these data must be transferred to distributed file systems such as HDFS, which enables data processing. The transfer cost issue has occurred in this case. To find a solution for this issue, we propose a method that performs high-speed data processing directly on a distributed KVS. In this paper, we extend the Apache Cassandra database, a distributed KVS that handles large amounts of data, to enable data affinity-based parallel processing. The parallel data processing mechanism runs the local processing on the stored values at each data node that stores the values, and it then returns only the results of the processing as an answer to a request. From the evaluation experiments, the proposed method is shown to be faster than the typically used Cassandra approach. In addition, even if the writing process is performed in the background while processing the data, the processing efficiency is appropriate for specific loads. The experimental results show that the data processing can be performed during the process of writing at approximately 10 Mbyte/sec if there are eight data nodes in the experiment environment.