An HMM approach to vowel restoration in Arabic and Hebrew

Authors:
Ya'akov Gal
Affiliations:
Harvard University, Cambridge, MA
Venue:
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Year:
2002

Citing 4
Cited 12

Learning morpho-lexical probabilities from an untagged corpus with an application to Hebrew

Computational Linguistics
Statistical Language Learning

Statistical Language Learning
Similarity-based estimation of word cooccurrence probabilities

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Arabic finite-state morphological analysis and generation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1

Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multi-agent Based Arabic Speech Recognition

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Arabic diacritic restoration approach based on maximum entropy models

Computer Speech and Language
Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Information Retrieval
Applying Finite State Morphology to Conversion Between Roman and Perso-Arabic Writing Systems

Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Shahmukhi to Gurmukhi transliteration system

COLING '08 22nd International Conference on on Computational Linguistics: Demonstration Papers
A hybrid approach for building Arabic diacritizer

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization of Arabic for acoustic modeling in speech recognition

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation

Artificial Intelligence Review
Part of speech tagging for arabic

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.