Unsupervised search for the optimal segmentation for statistical machine translation

  • Authors:
  • Coşkun Mermer

  • Affiliations:
  • Boğaziçi University, Bebek, Istanbul, Turkey and TÜBİTAK-UEKAE, Gebze, Kocaeli, Turkey

  • Venue:
  • ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.