gluepy: A Simple Distributed Python Programming Framework for Complex Grid Environments
Languages and Compilers for Parallel Computing
High-performance high-volume layered corpora annotation
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
ParaText: scalable text modeling and analysis
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Towards robust multi-tool tagging. An OWL/DL-based approach
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Hi-index | 0.01 |
This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.