A distributed look-up architecture for text mining applications using MapReduce

Authors:
Atilla Soner Balkir;Ian Foster;Andrey Rzhetsky
Affiliations:
University of Chicago, Chicago, IL;University of Chicago, Chicago, IL;University of Chicago, Chicago, IL
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 15
Cited 0

MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations that we consider, each step both reads and updates a database of parameter values. Motivated by a need for rapid analysis of large corpora, we have developed methods for efficient access to such databases on parallel computers. These methods combine Bloom filters, in-memory caches, and an HBase cluster to reduce communication costs greatly relative to simpler approaches that either fully distribute or fully replicate the database. We also describe how this method can be incorporated into the MapReduce programming model, and illustrate its use within phrase segmentation programs. Our design can achieve considerable run time, latency and storage space improvements relative to other methods. In one phrase segmentation application, we improve performance by a factor of six relative to an HBase-based implementation.