Morphological features for parsing morphologically-rich languages: a case of Arabic

  • Authors:
  • Jon Dehdari;Lamia Tounsi;Josef van Genabith

  • Affiliations:
  • The Ohio State University;Dublin City University;Dublin City University

  • Venue:
  • SPMRL '11 Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate how morphological features in the form of part-of-speech tags impact parsing performance, using Arabic as our test case. The large, fine-grained tagset of the Penn Arabic Treebank (498 tags) is difficult to handle by parsers, ultimately due to data sparsity. However, ad-hoc conflations of treebank tags runs the risk of discarding potentially useful parsing information. The main contribution of this paper is to describe several automated, language-independent methods that search for the optimal feature combination to help parsing. We first identify 15 individual features from the Penn Arabic Treebank tagset. Either including or excluding these features results in 32,768 combinations, so we then apply heuristic techniques to identify the combination achieving the highest parsing performance. Our results show a statistically significant improvement of 2.86% for vocalized text and 1.88% for unvocalized text, compared with the baseline provided by the Bikel-Bies Arabic POS mapping (and an improvement of 2.14% using product models for vocalized text, 1.65% for unvocalized text), giving state-of-the-art results for Arabic constituency parsing.