Automatic tagging of Arabic text: from raw text to base phrase chunks

  • Authors:
  • Mona Diab;Kadri Hacioglu;Daniel Jurafsky

  • Affiliations:
  • Stanford University;University of Colorado, Boulder;Stanford University

  • Venue:
  • HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

To date, there are no fully automated systems addressing the community's need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an Fβ=1 score of 99.12, the SVM-POS tagger achieves an accuracy of 95.49%, and the SVM-BP chunker yields an Fβ=1 score of 92.08.