Web data processing on the cloud

  • Authors:
  • Shahan Khatchadourian;Mariano Consens;Jerome Simeon

  • Affiliations:
  • University of Toronto;University of Toronto;IBM Corp.

  • Venue:
  • Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cloud computing is emerging as a highly scalable, fault-tolerant, and cost-effective way to process large amounts of information on the Web. Thanks in part to new data processing paradigms designed with the Cloud in mind (such as MapReduce[1], HDFS[2], Cassandra[3], etc), it is quickly gaining acceptance as a viable platform for organizations that need to store, process, and publish large amounts of data. MapReduce is attractive for processing data on the Cloud, to a large extent, because of its simplicity and flexibility. Implementations of MapReduce usually include a simple API used to describe which part of the processing is done in parallel (Map phase), and which part of the processing is done after grouping data on a single machine (Reduce phase). It does not rely on a pre-existing data model, making it possible to process any kind of information independently of the model. Cloud applications have notably been used to perform off-line analytical processing such as analyzing web request logs, computing user recommendations, and understanding scenes in images [4].