Effects of OCR errors on ranking and feedback using the vector space model
Information Processing and Management: an International Journal
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Term selection for searching printed Arabic
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval can Cope with Many Errors
Information Retrieval
Translation-Based Indexing for Cross-Language Retrieval
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Probabilistic structured query methods
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Digital Mountain: From Granite Archive to Global Access
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
A search engine for historical manuscript images
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining the language model and inference network approaches to retrieval
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Digital Image Processing (3rd Edition)
Digital Image Processing (3rd Edition)
Font Adaptive Word Indexing of Modern Printed Documents
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Recognition, Third Edition
Pattern Recognition, Third Edition
Word spotting for historical documents
International Journal on Document Analysis and Recognition
Keyword-guided word spotting in historical printed documents using synthetic data and user feedback
International Journal on Document Analysis and Recognition
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Efficient search in document image collections
ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Versatile search of scanned Arabic handwriting
SACH'06 Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition
Hi-index | 0.00 |
Recent book digitization initiatives have facilitated the access and search of millions of books. Although OCR remains essential for retrieving printed documents, OCR engines remain limited in the languages they handle and are generally expensive to build. This paper proposes a language independent approach that enables search through printed documents in a way that combines image-based matching with conventional IR techniques without using OCR. While image-based matching can be effective in finding similar words, complementing it with efficient retrieval techniques allows for sub-word matching, term weighting, and document ranking. The basic idea is that similar connected elements in printed documents are clustered and represented with ID's, which are then used to generate equivalent textual representations. The resultant representations are indexed using an IR engine and searched using the equivalent ID's of the connected elements in queries. Though, the main benefit of the proposed approach lies in languages for which no OCR exists, the technique was tested on English and Arabic to ascertain the relative effectiveness of the approach. The approach achieves more than 61% relative effectiveness compared to using OCR for both languages. While the reported numbers are lower than that of OCR-based approaches, the proposed method is fully automated, does not require any supervised training, and allows documents to be searchable within a few hours.