Domain adaptation in statistical machine translation using comparable corpora: case study for english latvian IT localisation

  • Authors:
  • Mārcis Pinnis;Inguna Skadiņa;Andrejs Vasiļjevs

  • Affiliations:
  • Tilde, Latvia;Tilde, Latvia;Tilde, Latvia

  • Venue:
  • CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the recent years, statistical machine translation (SMT) has received much attention from language technology researchers and it is more and more applied not only to widely used language pairs, but also to under-resourced languages. However, under-resourced languages and narrow domains face the problem of insufficient parallel data for building SMT systems of reasonable quality for practical applications. In this paper we show how broad domain SMT systems can be successfully tailored to narrow domains using data extracted from strongly comparable corpora. We describe our experiments on adaptation of a baseline English-Latvian SMT system trained on publicly available parallel data (mostly legal texts) to the information technology domain by adding data extracted from in-domain comparable corpora. In addition to comparative human evaluation the adapted SMT system was also evaluated in a real life localisation scenario. Application of comparable corpora provides significant improvements increasing human translation productivity by 13.6% while maintaining an acceptable quality of translation.