The effect of automatic tokenization, vocalization, stemming, and POS tagging on Arabic dependency parsing

Authors:
Emad Mohamed
Affiliations:
Suez Canal University, Suez, Egypt
Venue:
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Year:
2011

Citing 5
Cited 0

CoNLL-X shared task on multilingual dependency parsing

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Multilingual dependency analysis with a two-stage discriminative parser

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Labeled pseudo-projective dependency parsing with support vector machines

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Developing an Arabic treebank: methods, guidelines, procedures, and tools

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Improving Arabic dependency parsing with lexical and inflectional morphological features

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the modules, the very few errors in tokenization, which reaches an accuracy of 99.34%, lead to a drop of more than 10% in parsing, indicating that no high quality dependency parsing of Arabic, and possibly other morphologically rich languages, can be reached without (semi-)perfect tokenization. The other module components, stemming, vocalization, and part of speech tagging, do not have the same profound effect on the dependency parsing process.