ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora

  • Authors:
  • Mārcis Pinnis;Radu Ion;Dan Ştefănescu;Fangzhong Su;Inguna Skadiņa;Andrejs Vasiļjevs;Bogdan Babych

  • Affiliations:
  • Tilde, Riga, Latvia;Research Institute for Artificial Intelligence, Romanian Academy;Research Institute for Artificial Intelligence, Romanian Academy;University of Leeds;Tilde, Riga, Latvia;Tilde, Riga, Latvia;University of Leeds

  • Venue:
  • ACL '12 Proceedings of the ACL 2012 System Demonstrations
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.