High-performance high-volume layered corpora annotation

Authors:
Tiago Luís;David Martins de Matos
Affiliations:
L²F - INESC-ID, Lisboa, Portugal;L²F - INESC-ID, Lisboa, Portugal
Venue:
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Year:
2009

Citing 10
Cited 1

Java Native Interface: Programmer's Guide and Reference

Java Native Interface: Programmer's Guide and Reference
Learning from the Success of MPI

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Evolving GATE to meet new challenges in language engineering

Natural Language Engineering
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
UIMA GRID: Distributed Large-scale Text Analysis

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Using morphossyntactic information in TTS systems: comparing strategies for European Portuguese

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language

Towards robust multi-tool tagging. An OWL/DL-based approach

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

NLP systems that deal with large collections of text require significant computational resources, both in terms of space and processing time. Moreover, these systems typically add new layers of linguistic information with references to another layer. The spreading of these layered annotations across different files makes them more difficult to process and access the data. As the amount of input increases, so does the difficulty to process it. One approach is to use distributed parallel computing for solving these larger problems and save time. We propose a framework that simplifies the integration of independently existing NLP tools to build language-independent NLP systems capable of creating layered annotations. Moreover, it allows the development of scalable NLP systems, that executes NLP tools in parallel, while offering an easy-to-use programming environment and a transparent handling of distributed computing problems. With this framework the execution time was decreased to 40 times less than the original one on a cluster with 80 cores.