Developing a competitive HMM arabic POS tagger using small training corpora

  • Authors:
  • Mohammed Albared;Nazlia Omar;Mohd. Juzaiddin Ab Aziz

  • Affiliations:
  • University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science;University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science;University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science

  • Venue:
  • ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.