Arabic diacritic restoration approach based on maximum entropy models

Authors:
Imed Zitouni;Ruhi Sarikaya
Affiliations:
IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, United States;IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, United States
Venue:
Computer Speech and Language
Year:
2009

Citing 16
Cited 3

Building bilingual microcomputer systems

Communications of the ACM
On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
A maximum entropy approach to natural language processing

Computational Linguistics
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Text chunking based on a generalization of winnow

The Journal of Machine Learning Research
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An HMM approach to vowel restoration in Arabic and Hebrew

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Evaluation and extension of maximum entropy models with inequality constraints

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
The impact of morphological stemming on Arabic mention detection and coreference resolution

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization of Arabic for acoustic modeling in speech recognition

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The use of wavelet entropy in conjuction with neural network for Arabic vowels recognition

WSEAS Transactions on Signal Processing
Arabic vowels recognition based on wavelet average framing linear prediction coding and neural network

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In modern standard Arabic and in dialectal Arabic texts, short vowels and other diacritics are omitted. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Scripts without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. In this paper we present a maximum entropy approach for restoring short vowels and other diacritics in an Arabic document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a high-performance diacritic restoration model. Using a publicly available corpus (LDC's Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate of 4.0%, and a word error rate of 7.2%. We also show in this paper a comparison of our approach to previously published techniques and we demonstrate the effectiveness of this technique in restoring diacritics in different kind of data such as the dialectal Iraqi Arabic scripts.