Cluster computing for web-scale data processing

Authors:
Aaron Kimball;Sierra Michels-Slettvet;Christophe Bisciglia
Affiliations:
University of Washington, Seattle, WA, USA;Department of Computer Science and Engineering, University of Washington, WA, USA;Google, Inc., Mountain View, CA, USA
Venue:
Proceedings of the 39th SIGCSE technical symposium on Computer science education
Year:
2008

Citing 8
Cited 12

An integrated course on parallel and distributed processing

SIGCSE '98 Proceedings of the twenty-ninth SIGCSE technical symposium on Computer science education
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The ITC distributed file system: principles and design

Proceedings of the tenth ACM symposium on Operating systems principles
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Building Nutch: Open Source Search

Queue - Search Engines
Designing a runtime system for volunteer computing

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
An easy to use distributed computing framework

Proceedings of the 38th SIGCSE technical symposium on Computer science education
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Hadoop at home: large-scale computing at a small college

Proceedings of the 40th ACM technical symposium on Computer science education
Seattle: a platform for educational cloud computing

Proceedings of the 40th ACM technical symposium on Computer science education
Virtualized games for teaching about distributed systems

Proceedings of the 40th ACM technical symposium on Computer science education
Teaching about threading: where and what?

ACM SIGACT News
Teaching large scale data processing: the five-week course and two years' experiences

SCE '08 Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in China
Towards Efficient MapReduce Using MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience

Proceedings of the 41st ACM technical symposium on Computer science education
Automated control for elastic storage

Proceedings of the 7th international conference on Autonomic computing
Experiences teaching MapReduce in the cloud

Proceedings of the 43rd ACM technical symposium on Computer Science Education
Using clouds for MapReduce measurement assignments

ACM Transactions on Computing Education (TOCE)
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present the design of a modern course in cluster computing and large-scale data processing. The defining differences between this and previously published designs are its focus on processing very large data sets and its use of Hadoop, an open source Java-based implementation of MapReduce and the Google File System as the platform for programming exercises. Hadoop proved to be a key element for successfully implementing structured lab activities and independent design projects. Through this course, offered at the University of Washington in 2007, we imparted new skills on our students, improving their ability to design systems capable of solving web-scale problems.