Joining linguistic and statistical methods for Spanish-to-Basque speech translation

  • Authors:
  • Alicia Pérez;M. Inés Torres;Francisco Casacuberta

  • Affiliations:
  • Department of Electricity and Electronics, Faculty of Science and Technology, University of the Basque Country, 48940 Leioa, Spain;Department of Electricity and Electronics, Faculty of Science and Technology, University of the Basque Country, 48940 Leioa, Spain;Department of Information Systems and Computation, Faculty of Computer Science, Technical University of Valencia, Camí de Vera, s/n, 46071 Valencia, Spain

  • Venue:
  • Speech Communication
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of this work is to develop a text and speech translation system from Spanish to Basque. This pair of languages shows quite odd characteristics as they differ extraordinarily in both morphology and syntax, thus, attractive challenges in machine translation are involved. Nevertheless, since both languages share official status in the Basque Country, the underlying motivation is not only academic but also practical. Finite-state transducers were adopted as basic translation models. The main contribution of this work involves the study of several techniques to improve probabilistic finite-state transducers by means of additional linguistic knowledge. Two methods to cope with both linguistics and statistics were proposed. The first one performed a morphological analysis in an attempt to benefit from atomic meaningful units when it comes to rendering the meaning from one language to the other. The second approach aimed at clustering words according to their syntactic role and used such phrases as translation unit. From the latter approach phrase-based finite-state transducers arose as a natural extension of classical ones. The models were assessed under a restricted domain task, very repetitive and with a small vocabulary. Experimental results shown that both morphological and syntactical approaches outperformed the baseline under different test sets and architectures for speech translation.