A Hybrid Approach to Word Segmentation of Vietnamese Texts

Authors:
Lê Hông Phuong;Nguyên Thi Minh Huyên;Azim Roussanaly;Hô Tuòng Vinh
Affiliations:
LORIA, Nancy, France;Vietnam National University, Hanoi, Vietnam;LORIA, Nancy, France;IFI, Hanoi, Vietnam
Venue:
Language and Automata Theory and Applications
Year:
2008

Citing 4
Cited 1

Incremental construction of minimal acyclic finite-state automata

Computational Linguistics - Special issue on finite-state methods in NLP
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Chinese word segmentation based on maximum matching and word binding force

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics

Diacritics restoration in vietnamese: letter based vs. syllable based model

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.