Exploring large-data issues in the curriculum: a case study with MapReduce

Authors:
Jimmy Lin
Affiliations:
University of Maryland, College Park
Venue:
TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Year:
2008

Citing 6
Cited 4

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Cluster computing for web-scale data processing

Proceedings of the 39th SIGCSE technical symposium on Computer science education
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation

Data-intensive text processing with MapReduce

NAACL-Tutorials '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Prototyping an online wetland ecosystem services model using open model sharing standards

Environmental Modelling & Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design of a pilot research and educational effort at the University of Maryland centered around technologies for tackling Web-scale problems. In the context of a "cloud computing" initiative lead by Google and IBM, students and researchers are provided access to a computer cluster running Hadoop, an open-source Java implementation of Google's MapReduce framework. This technology provides an opportunity for students to explore large-data issues in the context of a course organized around teams of graduate and undergraduate students, in which they tackle open research problems in the human language technologies. This design represents one attempt to bridge traditional instruction with real-world, large-data research challenges.