N-th order Ergodic Multigram HMM for modeling of languages without marked word boundaries

Authors:
Hubert Hin-Cheung Law;Chorkin Chan
Affiliations:
The Univesity of Hong Kong;The University of Hong Kong
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Year:
1996

Citing 3
Cited 2

Class-based n-gram models of natural language

Computational Linguistics
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Detection of language (model) errors

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ergodic HMMs have been successfully used for modeling sentence production. However for some oriental languages such as Chinese, a word can consist of multiple characters without word boundary markers between adjacent words in a sentence. This makes word-segmentation on the training and testing data necessary before ergodic HMM can be applied as the language model. This paper introduces the N-th order Ergodic Multigram HMM for language modeling of such languages. Each state of the HMM can generate a variable number of characters corresponding to one word. The model can be trained without word-segmented and tagged corpus, and both segmentation and tagging are trained in one single model. Results on its application on a Chinese corpus are reported.