A corpus-based approach to language learning
A corpus-based approach to language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Evaluation of TnT Tagger for Spanish
ENC '03 Proceedings of the 4th Mexican International Conference on Computer Science
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Handling sparse data by successive abstraction
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Automatic tagging of Arabic text: from raw text to base phrase chunks
HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Arabic Natural Language Processing: Challenges and Solutions
ACM Transactions on Asian Language Information Processing (TALIP)
Performance analysis of a part of speech tagging task
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Automatic part of speech tagging for Arabic: an experiment using Bigram hidden Markov model
RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Hi-index | 0.00 |
Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.