Compression of scan-digitized Indian language printed text: a soft pattern matching technique

Authors:
U. Garain;S. Debnath;A. Mandal;B. B. Chaudhuri
Affiliations:
Indian Statistical Institute, India;Regional Engineering College, West Bengal, India;Defense Research & Development Organization, Pune, India;Indian Statistical Institute, Kolkata, India
Venue:
Proceedings of the 2003 ACM symposium on Document engineering
Year:
2003

Citing 8
Cited 2

Page segmentation and classification

CVGIP: Graphical Models and Image Processing
Skew Angle Detection of Digitized Indian Script Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document image compression and analysis

Document image compression and analysis
Document Representation and Its Application to Page Decomposition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Structural Compression for Documents Analysis

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Probability estimation for the Q-Coder

IBM Journal of Research and Development - Q-Coder adaptive binary arithmetic coder

An Approach for Stemming in Symbolically Compressed Indian Language Imaged Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Summarization of compressed text images: an experience on Indic script documents

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian sub-continent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.