Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

  • Authors:
  • Seth Kulick

  • Affiliations:
  • University of Pennsylvania

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for English, encoding both inflectional morphology and more complex tokenization sequences. This work takes a new approach to this problem based on a distinction between the open-class and closed-class categories of tokens, which differ both in their frequencies and in their possible morphological affixations. This separation simplifies the morphological analysis problem considerably, making it possible to use a Conditional Random Field model for joint tokenization and “core” part-of-speech tagging of the open-class items, while the closed-class items are handled by regular expressions. This work is therefore situated between data-driven approaches and those that use a morphological analyzer. For the tasks of tokenization and core part-of-speech tagging, the resulting system outperforms, on the given test set, a system that incorporates a morphological analyzer. We also evaluate the effects of the differences on parser performance when the tagger output is used for parser input.