Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Authors:
Mathias Creutz;Teemu Hirsimäki;Mikko Kurimo;Antti Puurula;Janne Pylkkönen;Vesa Siivola;Matti Varjokallio;Ebru Arisoy;Murat Saraçlar;Andreas Stolcke
Affiliations:
Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Boǧaziçi University, Istanbul;Boǧaziçi University, Istanbul;SRI International, Menlo Park International Computer Science Institute, Berkeley
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2007

Citing 16
Cited 11

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Korean large vocabulary continuous speech recognition with morpheme-based recognition units

Speech Communication
Unsupervised language acquisition

Unsupervised language acquisition
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Factored language models and generalized parallel backoff

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Unsupervised segmentation of words using prior distributions of morph length and frequency

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Knowledge-free induction of morphology using latent semantic analysis

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
A unified language model for large vocabulary continuous speech recognition of Turkish

Signal Processing - Fractional calculus applications in signals and systems
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Unlimited vocabulary speech recognition for agglutinative languages

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Corrective models for speech recognition of inflected languages

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Induction of a simple morphology for highly-inflecting languages

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

Guessers for Finite-State Transducer Lexicons

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Modeling morphologically rich languages using split words and unstructured dependencies

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
A hybrid morphologically decomposed factored language models for Arabic LVCSR

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improved recognition of spontaneous Hungarian speech: morphological and acoustic modeling techniques for a less resourced task

IEEE Transactions on Audio, Speech, and Language Processing
Semi-supervised learning of concatenative morphology

SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Automatic rule extraction for modeling pronunciation variation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Improved modeling of out-of-vocabulary words using morphological classes

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

ACM Transactions on Speech and Language Processing (TSLP)
Predictive text entry for agglutinative languages using unsupervised morphological segmentation

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
A comparative investigation of morphological language modeling for the languages of the European union

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Automatic speech recognition for under-resourced languages: A survey

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.