Investigating the best configuration of HMM spanish pos tagger when minimum amount of training data is available

Authors:
Sergio Ferrández;Jesús Peral
Affiliations:
Grupo de Investigación en Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, University of Alicante, Spain;Grupo de Investigación en Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, University of Alicante, Spain
Venue:
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Year:
2005

Citing 3
Cited 3

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Tagging English text with a probabilistic model

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing

Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization

Information Sciences: an International Journal
Model-driven restricted-domain adaptation of question answering systems for business intelligence

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Developing a competitive HMM arabic POS tagger using small training corpora

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the important processing steps for many natural language systems (information extraction, question answering, etc.) is Part-of-speech (PoS) tagging. This issue has been tackled with a number of different approaches in order to resolve this step. In this paper we study the functioning of a Hidden Markov Models (HMM) Spanish PoS tagger using a minimum amount of training corpora. Our PoS tagger is based on HMM where the states are tag pairs that emit words. It is based on transitional and lexical probabilities. This technique has been suggested by Rabiner [11] –and our implementation is influenced by Brants [2]–. We have investigated the best configuration of HMM using a small amount of training data which has about 50,000 words and the maximum precision obtained for an unknown Spanish text was 95.36%.