Generation-heavy hybrid machine translation

  • Authors:
  • Nizar Yahya Habash;Bonnie J. Dorr

  • Affiliations:
  • -;-

  • Venue:
  • Generation-heavy hybrid machine translation
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The state of the art techniques in Machine Translation (MT) require large amounts of symmetric resources from source and target languages. This is true regardless of whether the approach is Transfer or Interlingua, Symbolic or Statistical. Symmetry within these approaches is necessary to ensure quality, robustness and retargetability. In reality, such symmetry, whether it is in terms of structural transfer lexicons, interlingual dictionaries or parallel corpora, is a major bottleneck in developing any MT system. This dissertation presents an approach to MT that addresses the lack of symmetry by exploiting symbolic and statistical target language resources in source-poor/target-rich language pairs. This approach is called Generation-Heavy Hybrid Machine Translation (GHMT). Expected source language resources include a syntactic parser and a simple one-to-many translation dictionary. No transfer rules or complex interlingual representations are used. Rich target language symbolic resources are used to overgenerate multiple structural variations from a target-glossed syntactic dependency representation of source language sentences. Statistical target-language resource are then used to select amongst the overgenerated translations. The source-target asymmetry of systems developed in this approach makes them more easily retargetable to new source languages. The contributions of this research include: (1) a new model for machine translation that transcends the need for large amounts of symmetric knowledge while maintaining a high degree of robustness, quality, and retargetability; (2) a systematic framework for handling translation divergences that uniformly accommodates a wide range of seemingly different divergence types and their interactions; (3) a hybrid (symbolic-statistical) generation approach that expands the concept of symbolic overgeneration to include conflation and head-swapping of structural variations; (4) the introduction and use of structural n-grams on a large scale in natural language generation; and (5) the creation of several resources that have been used by other researchers including an extensible MT system for translating into English and a large-scale categorical variation database for English. An extensive evaluation suggests that GHMT is more robust and has superior output quality, in terms of grammaticality and accuracy, relative to a primarily statistical approach.