Building bilingual microcomputer systems
Communications of the ACM
On the limited memory BFGS method for large scale optimization
Mathematical Programming: Series A and B
A maximum entropy approach to natural language processing
Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Text chunking based on a generalization of winnow
The Journal of Machine Learning Research
Sequential conditional Generalized Iterative Scaling
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Language model based arabic word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An HMM approach to vowel restoration in Arabic and Hebrew
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Evaluation and extension of maximum entropy models with inequality constraints
EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Maximum entropy based restoration of Arabic diacritics
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Arabic diacritization through full morphological tagging
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
The impact of morphological stemming on Arabic mention detection and coreference resolution
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization of Arabic for acoustic modeling in speech recognition
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The use of wavelet entropy in conjuction with neural network for Arabic vowels recognition
WSEAS Transactions on Signal Processing
Hi-index | 0.00 |
In modern standard Arabic and in dialectal Arabic texts, short vowels and other diacritics are omitted. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Scripts without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. In this paper we present a maximum entropy approach for restoring short vowels and other diacritics in an Arabic document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a high-performance diacritic restoration model. Using a publicly available corpus (LDC's Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate of 4.0%, and a word error rate of 7.2%. We also show in this paper a comparison of our approach to previously published techniques and we demonstrate the effectiveness of this technique in restoring diacritics in different kind of data such as the dialectal Iraqi Arabic scripts.