An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts

  • Authors:
  • Cong Duy Vu Hoang;Ai Ti Aw

  • Affiliations:
  • Institute for Infocomm Research (I2R), Singapore;Institute for Infocomm Research (I2R), Singapore

  • Venue:
  • HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we propose a fully automatic approach combining both error detection and correction phases within a unique scheme. The scheme is designed in an unsupervised & data-driven manner, suitable for resource-poor languages. Based on the evaluation on real dataset in Vietnamese language, our approach gives an acceptable performance (detection accuracy 86%, correction accuracy 71%). In addition, we also give a result analysis to show how accurate our approach can achieve.