MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Benchmarking cloud serving systems with YCSB
Proceedings of the 1st ACM symposium on Cloud computing
Comet: an active distributed key-value store
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The SDSC storage resource broker
CASCON First Decade High Impact Papers
Hadoop: The Definitive Guide
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MyCassandra: a cloud storage supporting both read heavy and write heavy workloads
Proceedings of the 5th Annual International Systems and Storage Conference
Cassandra: The Definitive Guide
Cassandra: The Definitive Guide
Hi-index | 0.00 |
The spread of cloud computing has increased the necessity of accumulating large amounts of data and performing high-speed data processing. Because strict consistency is not necessarily required for such large amount of data that cloud computing stores, a distributed Key Value Stores (KVS) is considered suitable for their data storage, based on an eventual consistency paradigm. In order to provide services such as SNS, mining and statistical processing of these data is indispensable. However because general distributed KVS systems are not designed for processing, these data must be transferred to distributed file systems such as HDFS, which enables data processing. The transfer cost issue has occurred in this case. To find a solution for this issue, we propose a method that performs high-speed data processing directly on a distributed KVS. In this paper, we extend the Apache Cassandra database, a distributed KVS that handles large amounts of data, to enable data affinity-based parallel processing. The parallel data processing mechanism runs the local processing on the stored values at each data node that stores the values, and it then returns only the results of the processing as an answer to a request. From the evaluation experiments, the proposed method is shown to be faster than the typically used Cassandra approach. In addition, even if the writing process is performed in the background while processing the data, the processing efficiency is appropriate for specific loads. The experimental results show that the data processing can be performed during the process of writing at approximately 10 Mbyte/sec if there are eight data nodes in the experiment environment.