An Approach for Stemming in Symbolically Compressed Indian Language Imaged Documents

Authors:
Utpal Garain;Alok Kumar Datta
Affiliations:
Indian Statistical Institute, India;Indian Statistical Institute, India
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 6
Cited 0

Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
A vector space model for automatic indexing

Communications of the ACM
Information Retrieval from Documents: A Survey

Information Retrieval
An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Compression of scan-digitized Indian language printed text: a soft pattern matching technique

Proceedings of the 2003 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots, and thereby improving the overall retrieval efficiency. This paper presents an algorithm for stemming in the context of document image retrieval system. The algorithm assumes that the documents are symbolically compressed and stemming has been attempted in the compressed domain itself. Experiments have been conducted on Indian language imaged documents for which efficient OCR still remains a challenging task. Results obtained from a set 150 document images (in Bangla script, the second most popular script in the Indian sub-continent) consisting of about 12K word show a promising performance of the proposed approach.