Variable-length Markov models and ambiguous words in Portuguese

  • Authors:
  • Fabio Natanael Kepler;Marcelo Finger

  • Affiliations:
  • University of Sao Paulo, Sao Paulo, SP, Brazil;University of Sao Paulo, Sao Paulo, SP, Brazil

  • Venue:
  • YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Variable-Length Markov Chains (VLMCs) offer a way of modeling contexts longer than trigrams without suffering from data sparsity and state space complexity. However, in Historical Portuguese, two words show a high degree of ambiguity: que and a. The number of errors tagging these words corresponds to a quarter of the total errors made by a VLMC-based tagger. Moreover, these words seem to show two different types of ambiguity: one depending on non-local context and another on right context. We searched ways of expanding the VLMC-based tagger with a number of different models and methods in order to tackle these issues. The methods showed variable degrees of success, with one particular method solving much of the ambiguity of a. We explore reasons why this happened, and how everything we tried fails to improve the precision of que.