Building bilingual microcomputer systems
Communications of the ACM
A maximum entropy approach to natural language processing
Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Text chunking based on a generalization of winnow
The Journal of Machine Learning Research
Sequential conditional Generalized Iterative Scaling
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Language model based arabic word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An HMM approach to vowel restoration in Arabic and Hebrew
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
The impact of morphological stemming on Arabic mention detection and coreference resolution
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritization of Arabic for acoustic modeling in speech recognition
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Constrained Sequence Classification for Lexical Disambiguation
PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Arabic diacritic restoration approach based on maximum entropy models
Computer Speech and Language
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Arabic diacritization through full morphological tagging
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A hybrid approach for building Arabic diacritizer
Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Improving Arabic dependency parsing with lexical and inflectional morphological features
SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Decision trees for lexical smoothing in statistical machine translation
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Better Arabic parsing: baselines, evaluations, and analysis
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging
ACM Transactions on Asian Language Information Processing (TALIP)
Artificial Intelligence Review
Part of speech tagging for arabic
Natural Language Engineering
Supervised collaboration for syntactic annotation of Quranic Arabic
Language Resources and Evaluation
Dependency parsing of modern standard arabic with lexical and inflectional features
Computational Linguistics
Hi-index | 0.00 |
Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a state-of-the-art diacritization model. Using a publicly available corpus (LDC's Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate 4.0%, and a word error rate of 7.2%.