A distributed look-up architecture for text mining applications using mapreduce

Authors:
Atilla Soner Balkir;Ian Foster;Andrey Rzhetsky
Affiliations:
University of Chicago, Chicago, IL, USA;University of Chicago, Chicago, IL, USA;Computation Institute, Chicago, IL, USA
Venue:
Proceedings of the 20th international symposium on High performance distributed computing
Year:
2011

Citing 3
Cited 1

Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008

A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations that we consider, each step both reads and updates a database of parameter values. Motivated by a need for rapid analysis of large corpora, we have developed methods for efficient access to such databases on parallel computers. These methods combine Bloom filters, in-memory caches, and an HBase cluster to reduce communication costs greatly relative to simpler approaches that either fully distribute or fully replicate the database. Our design can achieve considerable run time, latency and storage space improvements relative to other methods. In one segmentation application, we improve performance by a factor of 3 relative to an HBase-based implementation.