A post-processing scheme for malayalam using statistical sub-character language models
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Hi-index | 0.00 |
The approach in this paper focuses on the contextual postprocessing by selecting the most feasible word from multiple output strings of an OCR system. The correction is applied only when the selection fails. The selected word is confirmed by the collocation between the word and the adjacent words. The five functions applied in the system are (1) to select a word from candidate words, (2) to correct candidate words using a confusion matrix of syllables, (3) to combine two substrings to a word that spans two lines, (4) to guess unknown nouns, and (5) to confirm a selected word by the contextual information of adjacent words. To improve speed, we use syllable di-grams and viable-prefixes of Korean words. The experimental result shows that the two heuristics speed up the system more than 1,000 times in worst case. Our system improves the word recognition rate of the OCR system from 90.50% to 94.72%.