Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

Authors:
Jimmy Lin
Affiliations:
University of Maryland
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 14
Cited 6

Word association norms, mutual information, and lexicography

Computational Linguistics
A cooccurrence-based thesaurus and two applications to information retrieval

Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
On the MSE robustness of batching estimators

Proceedings of the 33nd conference on Winter simulation
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic word sense discrimination

Computational Linguistics - Special issue on word sense disambiguation
The utility business model and the future of computing services

IBM Systems Journal
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Distributional measures of concept-distance: a task-oriented evaluation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation

Phrase clustering for discriminative learning

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Large-scale music tag recommendation with explicit multiple attributes

Proceedings of the international conference on Multimedia
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
Computing scientometrics in large-scale academic search engines with mapreduce

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Distributional phrasal paraphrase generation for statistical machine translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in commercial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. I discuss two barriers contributing to this problem: lack of a suitable programming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google's MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides system-level details from the developer, and its ability to run on commodity hardware puts cluster computing within the reach of many academic research groups. This paper illustrates these points with a case study in building word cooccurrence matrices from large corpora. I conclude with an analysis of an alternative computing model based on renting instead of buying computer clusters.