Word association norms, mutual information, and lexicography
Computational Linguistics
A cooccurrence-based thesaurus and two applications to information retrieval
Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants
ACM Transactions on Information Systems (TOIS)
On the MSE robustness of batching estimators
Proceedings of the 33nd conference on Winter simulation
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic word sense discrimination
Computational Linguistics - Special issue on word sense disambiguation
The utility business model and the future of computing services
IBM Systems Journal
Scaling to very very large corpora for natural language disambiguation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Distributional measures of concept-distance: a task-oriented evaluation
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Phrase clustering for discriminative learning
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Proceedings of the 19th international conference on World wide web
Large-scale music tag recommendation with explicit multiple attributes
Proceedings of the international conference on Multimedia
Accelerating text mining workloads in a MapReduce-based distributed GPU environment
Journal of Parallel and Distributed Computing
Computing scientometrics in large-scale academic search engines with mapreduce
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Distributional phrasal paraphrase generation for statistical machine translation
ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Hi-index | 0.00 |
This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in commercial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. I discuss two barriers contributing to this problem: lack of a suitable programming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google's MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides system-level details from the developer, and its ability to run on commodity hardware puts cluster computing within the reach of many academic research groups. This paper illustrates these points with a case study in building word cooccurrence matrices from large corpora. I conclude with an analysis of an alternative computing model based on renting instead of buying computer clusters.