Corpus-based stemming using cooccurrence of word variants
ACM Transactions on Information Systems (TOIS)
The indexing and retrieval of document images: a survey
Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
A vector space model for automatic indexing
Communications of the ACM
Information Retrieval from Documents: A Survey
Information Retrieval
An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Compression of scan-digitized Indian language printed text: a soft pattern matching technique
Proceedings of the 2003 ACM symposium on Document engineering
Hi-index | 0.00 |
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots, and thereby improving the overall retrieval efficiency. This paper presents an algorithm for stemming in the context of document image retrieval system. The algorithm assumes that the documents are symbolically compressed and stemming has been attempted in the compressed domain itself. Experiments have been conducted on Indian language imaged documents for which efficient OCR still remains a challenging task. Results obtained from a set 150 document images (in Bangla script, the second most popular script in the Indian sub-continent) consisting of about 12K word show a promising performance of the proposed approach.