Combining trigram and Winnow in thai OCR error correction

Authors:
Surapant Meknavin;Boonserm Kijsirikul;Ananlada Chotimongkol;Cholwich Nuttee
Affiliations:
National Electronics and Computer Technology Center, Bangkok, Thailand;Chulalongkorn University, Thailand;Chulalongkorn University, Thailand;Chulalongkorn University, Thailand
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Year:
1998

Citing 5
Cited 2

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Empirical Support for Winnow and Weighted-MajorityAlgorithms: Results on a Calendar Scheduling Domain

Machine Learning
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Context-based spelling correction for Japanese OCR

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Combining trigram and automatic weight distribution in Chinese spelling error correction

Journal of Computer Science and Technology
An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words, we used a modified edit distance which reflects the characteristic of Thai OCR errors. Finally, a part-of-speech trigram model and Winnow algorithm are combined to determine the most probable correction.