N-th order Ergodic Multigram HMM for modeling of languages without marked word boundaries

  • Authors:
  • Hubert Hin-Cheung Law;Chorkin Chan

  • Affiliations:
  • The Univesity of Hong Kong;The University of Hong Kong

  • Venue:
  • COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ergodic HMMs have been successfully used for modeling sentence production. However for some oriental languages such as Chinese, a word can consist of multiple characters without word boundary markers between adjacent words in a sentence. This makes word-segmentation on the training and testing data necessary before ergodic HMM can be applied as the language model. This paper introduces the N-th order Ergodic Multigram HMM for language modeling of such languages. Each state of the HMM can generate a variable number of characters corresponding to one word. The model can be trained without word-segmented and tagged corpus, and both segmentation and tagging are trained in one single model. Results on its application on a Chinese corpus are reported.