UIMA GRID: Distributed Large-scale Text Analysis

Authors:
Michael Thomas Egner;Markus Lorch;Edd Biddle
Affiliations:
Albstadt-Sigmaringen University, Germany;IBM Germany Development Lab Boeblingen, Germany;IBM United Kingdom
Venue:
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Year:
2007

Citing 0
Cited 4

gluepy: A Simple Distributed Python Programming Framework for Complex Grid Environments

Languages and Compilers for Parallel Computing
High-performance high-volume layered corpora annotation

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
ParaText: scalable text modeling and analysis

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Towards robust multi-tool tagging. An OWL/DL-based approach

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.