Page Layout Analyser for Multilingual Indian Documents

  • Authors:
  • A. Ray Chaudhuri;A. K. Mandal;B. B. Chaudhuri

  • Affiliations:
  • -;-;-

  • Venue:
  • LEC '02 Proceedings of the Language Engineering Conference (LEC'02)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

An advanced Optical Character Recognition (OCR) system is equipped with the module of the page layout analyer.It separates textual zones from non-textual zones.It identifies textual blocks from multicolumn documents and groups them into homogenous regions in terms of geometric shape and spatial distribution.All existing OCR modules developed for various Indian script can handle text only single-column documents.In this paper, a page layout analyser that uses typical common features present in most of the Indian scripts is introduced.a simple compatibility criterion that allows various degrees of homogeneity is defined.The page-analyser is robust in the sense that it can distinguish text regions from non-textual entities such as images, rulers, and noisy signals due to smudges and poor quality of the paper.Test results are shown in two most popular Indian Scripts, Devnagari (Hindi) and Bangla.