High-performance high-volume layered corpora annotation

  • Authors:
  • Tiago Luís;David Martins de Matos

  • Affiliations:
  • L2F - INESC-ID, Lisboa, Portugal;L2F - INESC-ID, Lisboa, Portugal

  • Venue:
  • ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

NLP systems that deal with large collections of text require significant computational resources, both in terms of space and processing time. Moreover, these systems typically add new layers of linguistic information with references to another layer. The spreading of these layered annotations across different files makes them more difficult to process and access the data. As the amount of input increases, so does the difficulty to process it. One approach is to use distributed parallel computing for solving these larger problems and save time. We propose a framework that simplifies the integration of independently existing NLP tools to build language-independent NLP systems capable of creating layered annotations. Moreover, it allows the development of scalable NLP systems, that executes NLP tools in parallel, while offering an easy-to-use programming environment and a transparent handling of distributed computing problems. With this framework the execution time was decreased to 40 times less than the original one on a cluster with 80 cores.