Automatic diacritization of Arabic for acoustic modeling in speech recognition

Authors:
Dimitra Vergyri;Katrin Kirchhoff
Affiliations:
SRI International, Menlo Park, CA;University of Washington, Seattle, WA
Venue:
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Year:
2004

Citing 1
Cited 11

An HMM approach to vowel restoration in Arabic and Hebrew

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages

Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multi-agent Based Arabic Speech Recognition

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Syllable-based automatic Arabic speech recognition

ISPRA'08 Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation
Syllable-based automatic arabic speech recognition in noisy-telephone channel

WSEAS Transactions on Signal Processing
Arabic diacritic restoration approach based on maximum entropy models

Computer Speech and Language
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Improving the Arabic pronunciation dictionary for phone and word recognition with linguistically-based pronunciation rules

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Decision trees for lexical smoothing in statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Phonetically rich and balanced text and speech corpora for Arabic language

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic recognition of Arabic dialectal speech is a challenging task because Arabic dialects are essentially spoken varieties. Only few dialectal resources are available to date; moreover, most available acoustic data collections are transcribed without diacritics. Such a transcription omits essential pronunciation information about a word, such as short vowels. In this paper we investigate various procedures that enable us to use such training data by automatically inserting the missing diacritics into the transcription. These procedures use acoustic information in combination with different levels of morphological and contextual constraints. We evaluate their performance against manually diacritized transcriptions. In addition, we demonstrate the effect of their accuracy on the recognition performance of acoustic models trained on automatically diacritized training data.