Domain adaptation in statistical machine translation using comparable corpora: case study for english latvian IT localisation

Authors:
Mārcis Pinnis;Inguna Skadiņa;Andrejs Vasiļjevs
Affiliations:
Tilde, Latvia;Tilde, Latvia;Tilde, Latvia
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Year:
2013

Citing 11
Cited 0

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Methodologies for Measuring the Correlations between Post-Editing Effort and Machine Translatability

Machine Translation
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Statistical Machine Translation

Statistical Machine Translation
Improving SMT for Baltic Languages with Factored Models

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-)parallel translation equivalents

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
LetsMT!: a cloud-based platform for do-it-yourself machine translation

ACL '12 Proceedings of the ACL 2012 System Demonstrations
ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora

ACL '12 Proceedings of the ACL 2012 System Demonstrations

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the recent years, statistical machine translation (SMT) has received much attention from language technology researchers and it is more and more applied not only to widely used language pairs, but also to under-resourced languages. However, under-resourced languages and narrow domains face the problem of insufficient parallel data for building SMT systems of reasonable quality for practical applications. In this paper we show how broad domain SMT systems can be successfully tailored to narrow domains using data extracted from strongly comparable corpora. We describe our experiments on adaptation of a baseline English-Latvian SMT system trained on publicly available parallel data (mostly legal texts) to the information technology domain by adding data extracted from in-domain comparable corpora. In addition to comparative human evaluation the adapted SMT system was also evaluated in a real life localisation scenario. Application of comparable corpora provides significant improvements increasing human translation productivity by 13.6% while maintaining an acceptable quality of translation.