Distributed parallel architecture for storing and processing large datasets

Authors:
Catalin Boja;Adrian Pocovnicu
Affiliations:
Department of Economic Informatics and Cybernetics, Bucharest Academy of Economic Studies, Bucharest, Romania;Department of Economic Informatics and Cybernetics, Bucharest Academy of Economic Studies, Bucharest, Romania
Venue:
SEPADS'12/EDUCATION'12 Proceedings of the 11th WSEAS international conference on Software Engineering, Parallel and Distributed Systems, and proceedings of the 9th WSEAS international conference on Engineering Education
Year:
2012

Citing 8
Cited 0

Distributed and parallel database systems

ACM Computing Surveys (CSUR)
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
If You Have Too Much Data, then “Good Enough” Is Good Enough

Queue - Programming Languages
How Will Astronomy Archives Survive the Data Tsunami?

Queue - Programming Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

We live in the data age as data storage technologies, hardware and software, have evolved to a point at which it is very cheap to store large volumes of data, structured and unstructured. The increased popularity of social media has contributed to the accumulation of large data volumes, mostly unstructured, which analyzed could yield valuable insight. Extracting meaningful, useful and accurate information in a timely manner from very large data sets is a complex task that requires a careful selection of the right hardware software and data model. This paper analyzes the problem of storing, processing and retrieving meaningful insight from petabytes of data. It provides a survey on current distributed and parallel data processing technologies and, based on them, will propose an architecture that can be used to solve the analyzed problem.