An HMM approach to vowel restoration in Arabic and Hebrew

  • Authors:
  • Ya'akov Gal

  • Affiliations:
  • Harvard University, Cambridge, MA

  • Venue:
  • SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.