Efficient Language-Independent Retrieval of Printed Documents without OCR

Authors:
Walid Magdy;Kareem Darwish;Motaz El-Saban
Affiliations:
School of Computing, Dublin City University, Dublin 9, Ireland;Cairo Microsoft Innovation Center, Microsoft, Abou Rawash, Egypt B115;Cairo Microsoft Innovation Center, Microsoft, Abou Rawash, Egypt B115
Venue:
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Year:
2009

Citing 18
Cited 0

Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval can Cope with Many Errors

Information Retrieval
Translation-Based Indexing for Cross-Language Retrieval

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Digital Mountain: From Granite Archive to Global Access

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining the language model and inference network approaches to retrieval

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Digital Image Processing (3rd Edition)

Digital Image Processing (3rd Edition)
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Recognition, Third Edition

Pattern Recognition, Third Edition
Word spotting for historical documents

International Journal on Document Analysis and Recognition
Keyword-guided word spotting in historical printed documents using synthetic data and user feedback

International Journal on Document Analysis and Recognition
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Efficient search in document image collections

ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Versatile search of scanned Arabic handwriting

SACH'06 Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent book digitization initiatives have facilitated the access and search of millions of books. Although OCR remains essential for retrieving printed documents, OCR engines remain limited in the languages they handle and are generally expensive to build. This paper proposes a language independent approach that enables search through printed documents in a way that combines image-based matching with conventional IR techniques without using OCR. While image-based matching can be effective in finding similar words, complementing it with efficient retrieval techniques allows for sub-word matching, term weighting, and document ranking. The basic idea is that similar connected elements in printed documents are clustered and represented with ID's, which are then used to generate equivalent textual representations. The resultant representations are indexed using an IR engine and searched using the equivalent ID's of the connected elements in queries. Though, the main benefit of the proposed approach lies in languages for which no OCR exists, the technique was tested on English and Arabic to ascertain the relative effectiveness of the approach. The approach achieves more than 61% relative effectiveness compared to using OCR for both languages. While the reported numbers are lower than that of OCR-based approaches, the proposed method is fully automated, does not require any supervised training, and allows documents to be searchable within a few hours.