An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts

Authors:
Cong Duy Vu Hoang;Ai Ti Aw
Affiliations:
Institute for Infocomm Research (I2R), Singapore;Institute for Infocomm Research (I2R), Singapore
Venue:
HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
Year:
2012

Citing 15
Cited 0

A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Error correction in a Chinese OCR test collection

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach

IEEE Transactions on Knowledge and Data Engineering
Combining trigram and Winnow in thai OCR error correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese OCR error correction using character shape similarity and statistical language model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Context-based spelling correction for Japanese OCR

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
OCR error correction using a noisy channel model

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Effect of OCR error correction on Arabic retrieval

Information Retrieval
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Real-word spelling correction using Google Web IT 3-grams

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Statistical Machine Translation

Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we propose a fully automatic approach combining both error detection and correction phases within a unique scheme. The scheme is designed in an unsupervised & data-driven manner, suitable for resource-poor languages. Based on the evaluation on real dataset in Vietnamese language, our approach gives an acceptable performance (detection accuracy 86%, correction accuracy 71%). In addition, we also give a result analysis to show how accurate our approach can achieve.