Developing a competitive HMM arabic POS tagger using small training corpora

Authors:
Mohammed Albared;Nazlia Omar;Mohd. Juzaiddin Ab Aziz
Affiliations:
University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science;University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science;University Kebangsaan Malaysia, Faculty of Information Science and Technology, Department of Computer Science
Venue:
ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Year:
2011

Citing 12
Cited 0

A corpus-based approach to language learning

A corpus-based approach to language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Evaluation of TnT Tagger for Spanish

ENC '03 Proceedings of the 4th Mexican International Conference on Computer Science
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Handling sparse data by successive abstraction

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Arabic Natural Language Processing: Challenges and Solutions

ACM Transactions on Asian Language Information Processing (TALIP)
Performance analysis of a part of speech tagging task

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Automatic part of speech tagging for Arabic: an experiment using Bigram hidden Markov model

RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Investigating the best configuration of HMM spanish pos tagger when minimum amount of training data is available

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.