Incremental construction of minimal acyclic finite-state automata
Computational Linguistics - Special issue on finite-state methods in NLP
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Chinese word segmentation based on maximum matching and word binding force
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
Computational Linguistics
Diacritics restoration in vietnamese: letter based vs. syllable based model
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Hi-index | 0.00 |
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.