Maximum entropy based restoration of Arabic diacritics

Authors:
Imed Zitouni;Jeffrey S. Sorensen;Ruhi Sarikaya
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 11
Cited 18

Building bilingual microcomputer systems

Communications of the ACM
A maximum entropy approach to natural language processing

Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Text chunking based on a generalization of winnow

The Journal of Machine Learning Research
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An HMM approach to vowel restoration in Arabic and Hebrew

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
The impact of morphological stemming on Arabic mention detection and coreference resolution

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization of Arabic for acoustic modeling in speech recognition

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

Constrained Sequence Classification for Lexical Disambiguation

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Arabic diacritic restoration approach based on maximum entropy models

Computer Speech and Language
Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Information Retrieval
Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Improving the Arabic pronunciation dictionary for phone and word recognition with linguistically-based pronunciation rules

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A hybrid approach for building Arabic diacritizer

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improving Arabic-to-English statistical machine translation by reordering post-verbal subjects for alignment

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Improving Arabic dependency parsing with lexical and inflectional morphological features

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Decision trees for lexical smoothing in statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Better Arabic parsing: baselines, evaluations, and analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

ACM Transactions on Asian Language Information Processing (TALIP)
Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Machine Translation
Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation

Artificial Intelligence Review
Part of speech tagging for arabic

Natural Language Engineering
Supervised collaboration for syntactic annotation of Quranic Arabic

Language Resources and Evaluation
Dependency parsing of modern standard arabic with lexical and inflectional features

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a state-of-the-art diacritization model. Using a publicly available corpus (LDC's Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate 4.0%, and a word error rate of 7.2%.