Teaching large scale data processing: the five-week course and two years' experiences

Authors:
Kang Chen;Yubing Yin;Weimin Zheng
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
SCE '08 Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in China
Year:
2008

Citing 9
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Cluster computing for web-scale data processing

Proceedings of the 39th SIGCSE technical symposium on Computer science education

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have setup a new course on the large scale data processing using clusters. It introduces the concepts and design of distributed systems. Many newly developed ideas such as Google file system and MapReduce programming framework for processing large scale data sets are introduced. Students will gain practical experience with distributed programming technologies via several small labs and one large multi-week final project. Labs and projects will be completed using Hadoop, an open-source implementation of Google's distributed file system and MapReduce programming model. We have taught this class named "Mass Data Processing Technology on Large Scale Clusters" for two years. This paper will describe the design, perform of the course as well as the experiences and lessons learned.