SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

  • Authors:
  • Horia Cucu;Andi Buzo;Laurent Besacier;Corneliu Burileanu

  • Affiliations:
  • University "Politehnica" of Bucharest, Romania and LIG, University Joseph Fourier, Grenoble, France;University "Politehnica" of Bucharest, Romania;LIG, University Joseph Fourier, Grenoble, France;University "Politehnica" of Bucharest, Romania

  • Venue:
  • Speech Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.