Language-independent compound splitting with morphological operations

  • Authors:
  • Klaus Macherey;Andrew M. Dai;David Talbot;Ashok C. Popat;Franz Och

  • Affiliations:
  • Google Inc., Mountain View, CA;University of Edinburgh, Edinburgh, UK;Google Inc., Mountain View, CA;Google Inc., Mountain View, CA;Google Inc., Mountain View, CA

  • Venue:
  • HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.