Unlimited vocabulary speech recognition for agglutinative languages

Authors:
Mikko Kurimo;Antti Puurula;Ebru Arisoy;Vesa Siivola;Teemu Hirsimäki;Janne Pylkkönen;Tanel Alumäe;Murat Saraclar
Affiliations:
Helsinki University of Technology, HUT, Finland;Helsinki University of Technology, HUT, Finland;Bogazici University, Bebek, Istanbul, Turkey;Helsinki University of Technology, HUT, Finland;Helsinki University of Technology, HUT, Finland;Helsinki University of Technology, HUT, Finland;Tallinn Technical University, Estonia;Bogazici University, Bebek, Istanbul, Turkey
Venue:
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Year:
2006

Citing 1
Cited 11

Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

ACM Transactions on Speech and Language Processing (TSLP)
Acoustic Modelling for Croatian Speech Recognition and Synthesis

Informatica
Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Morpho challenge: evaluation of algorithms for unsupervised learning of morphology in various tasks and languages

NAACL-Demonstrations '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session
Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

Computer Speech and Language
Towards automatic transcription of large spoken archives in agglutinating languages - Hungarian ASR for the MALACH project

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Improved recognition of spontaneous Hungarian speech: morphological and acoustic modeling techniques for a less resourced task

IEEE Transactions on Audio, Speech, and Language Processing
Applying morphological decomposition to statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
State-of-the-art speech recognition technologies for Russian language

Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environments
Automatic speech recognition for under-resourced languages: A survey

Speech Communication
Large vocabulary Russian speech recognition using syntactico-statistical language modeling

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is practically impossible to build a word-based lexicon for speech recognition in agglutinative languages that would cover all the relevant words. The problem is that words are generally built by concatenating several prefixes and suffixes to the word roots. Together with compounding and inflections this leads to millions of different, but still frequent word forms. Due to inflections, ambiguity and other phenomena, it is also not trivial to automatically split the words into meaningful parts. Rule-based morphological analyzers can perform this splitting, but due to the handcrafted rules, they also suffer from an out-of-vocabulary problem. In this paper we apply a recently proposed fully automatic and rather language and vocabulary independent way to build sub-word lexica for three different agglutinative languages. We demonstrate the language portability as well by building a successful large vocabulary speech recognizer for each language and show superior recognition performance compared to the corresponding word-based reference systems.