Constrained Sequence Classification for Lexical Disambiguation

Authors:
Tran The Truyen;Dinh Q. Phung;Svetha Venkatesh
Affiliations:
Department of Computing, Curtin University of Technology, Western Australia, Australia 6845;Department of Computing, Curtin University of Technology, Western Australia, Australia 6845;Department of Computing, Curtin University of Technology, Western Australia, Australia 6845
Venue:
PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Year:
2008

Citing 10
Cited 1

A maximum entropy approach to natural language processing

Computational Linguistics
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Real-time automatic insertion of accents in French text

Natural Language Engineering
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Letter level learning for language independent diacritics restoration

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Automatic diacritic restoration for resource-scarce languages

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue

Diacritics restoration in vietnamese: letter based vs. syllable based model

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses lexical ambiguity with focus on a particular problem known as accent prediction, in that given an accentless sequence, we need to restore correct accents. This can be modelled as a sequence classification problem for which variants of Markov chains can be applied. Although the state space is large (about the vocabulary size), it is highly constrained when conditioned on the data observation. We investigate the application of several methods, including Powered Product-of-N -grams, Structured Perceptron and Conditional Random Fields (CRFs). We empirically show in the Vietnamese case that these methods are fairly robust and efficient. The second-order CRFs achieve best results with about 94% term accuracy.