SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Authors:
Horia Cucu;Andi Buzo;Laurent Besacier;Corneliu Burileanu
Affiliations:
University "Politehnica" of Bucharest, Romania and LIG, University Joseph Fourier, Grenoble, France;University "Politehnica" of Bucharest, Romania;LIG, University Joseph Fourier, Grenoble, France;University "Politehnica" of Bucharest, Romania
Venue:
Speech Communication
Year:
2014

Citing 13
Cited 1

Language model adaptation with additional text generated by machine translation

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
tRuEcasIng

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Multilingual Speech Processing

Multilingual Speech Processing
Joint-sequence models for grapheme-to-phoneme conversion

Speech Communication
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Grapheme-to-phoneme transcription rules for Spanish, with application to automatic speech recognition and synthesis

Transcribe '98 Proceedings of the Workshop on Partially Automated Techniques for Transcribing Naturally Occurring Continuous Speech
Rule-Based Automatic Phonetic Transcription for the Romanian Language

COMPUTATIONWORLD '09 Proceedings of the 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns
Automatic speech recognition for under-resourced languages: application to Vietnamese language

IEEE Transactions on Audio, Speech, and Language Processing
Printed romanian modelling: a corpus linguistics based study with orthography and punctuation marks included

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part I
Enhanced Rule-Based Phonetic Transcription for the Romanian Language

SYNASC '09 Proceedings of the 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
Comparing SMT methods for automatic generation of pronunciation variants

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
A Romanian corpus for speech perception and automatic speech recognition

NEHIPISIC'11 Proceeding of 10th WSEAS international conference on electronics, hardware, wireless and optical communications, and 10th WSEAS international conference on signal processing, robotics and automation, and 3rd WSEAS international conference on nanotechnology, and 2nd WSEAS international conference on Plasma-fusion-nuclear physics

Automatic speech recognition for under-resourced languages: A survey

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.