Compression of scan-digitized Indian language printed text: a soft pattern matching technique

  • Authors:
  • U. Garain;S. Debnath;A. Mandal;B. B. Chaudhuri

  • Affiliations:
  • Indian Statistical Institute, India;Regional Engineering College, West Bengal, India;Defense Research & Development Organization, Pune, India;Indian Statistical Institute, Kolkata, India

  • Venue:
  • Proceedings of the 2003 ACM symposium on Document engineering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian sub-continent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.