On clusterization of "big data" streams

  • Authors:
  • Simon Berkovich;Duoduo Liao

  • Affiliations:
  • George Washington University, Washington, DC;George Washington University, Washington, DC

  • Venue:
  • Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current technology provides wonderful facilities for operating with extremely vast amounts of data. These facilities are expanding due to capabilities of "Cloud Computing." The developing situation gives rise to the "Big Data" concept posing specific engineering and organizational challenges. Big data refers to the rising flood of digital data from many sources, including the sensors, digitizers, scanners, software-based modeling, mobile phones, internet, videos, e-mails, and social network communications. The data type could be texts, geometries, images, videos, sounds, or their combination. Many of such data are directly or indirectly related to geospatial information. In this paper, we suggest to enhance the available information processing resources with a novel software/hardware technique for on-the-fly clusterization of amorphous data from diverse sources. The presented approach is based on the previously developed construction of FuzzyFind Dictionary utilizing the error-correction Golay Code. Realization of this technique requires processing of intensive continuous data streams, which can be effectively implemented using multi-core pipelining with forced interrupts. The objective of this paper is to bring forward a new simple and efficacious tool for one of the most demanding operations of this "Big Data" methodology --clustering of diverse information items in a data stream mode. Improving our ability to extract knowledge and insights from large and complex collections of digital data promises to solve some the Nation's most pressing challenges. Furthermore, the paper reveals a parallel between the computational model integrating "Big Data" streams and the organization of information processing in the brain. The uncertainties in relation to the considered method of clusterization are moderated due to the idea of the bounded rationality, an approach that does not require a complete exact knowledge for sensible decision-making.