Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
The use of phrases and structured queries in information retrieval
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Creating segmented databases from free text for text retrieval
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
A textual object management system
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the first TREC conference
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
The effects of noisy data on text retrieval
Journal of the American Society for Information Science
Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Effects of OCR errors on ranking and feedback using the vector space model
Information Processing and Management: an International Journal
Image Analysis Applications
Characteristics of Optical Text Recognition Programs
Programming and Computing Software
Probabilistic Automaton Model for Fuzzy English-Text Retrieval
ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
An Investigation of Mixed-Media Information Retrieval
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Hairetes: A Search Engine for OCR Documents
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
New Challenges for Cross-Language Information Retrieval: Multimedia Data and the User Experience
CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Document Image Retrieval Based on 2D Density Distributions of Terms with Pseudo Relevance Feedback
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Information access in the presence of OCR errors
Proceedings of the 1st ACM workshop on Hardcopy document processing
Robust document image understanding technologies
Proceedings of the 1st ACM workshop on Hardcopy document processing
Performance evaluation for text processing of noisy inputs
Proceedings of the 2005 ACM symposium on Applied computing
Summarization of noisy documents: a pilot study
HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Information Processing and Management: an International Journal
Font Adaptive Word Indexing of Modern Printed Documents
IEEE Transactions on Pattern Analysis and Machine Intelligence
An approximate multi-word matching algorithm for robust document retrieval
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Effect of OCR error correction on Arabic retrieval
Information Retrieval
CMIC at INEX 2007: Book Search Track
Focused Access to XML Documents
Book search: indexing the valuable parts
Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Information Processing and Management: an International Journal
Comparative information retrieval evaluation for scanned documents
Proceedings of the 15th WSEAS international conference on Computers
Improved stable retrieval in noisy collections
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Using string comparison in context for improved relevance feedback in different text media
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
ICVGIP'06 Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing
Hi-index | 0.00 |
We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in these experiments include both actual OCR-generated text and standard information retrieval collections corrupted through the simulation of OCR errors. Both the actual and simulation experiments include full-text and abstract-length documents. We also demonstrate that the ranking and feedback methods associated with these models are generally not robust enough to deal with OCR errors. It is further shown that the OCR errors and garbage strings generated from the mistranslation of graphic objects increase the size of the index by a wide margin. We not only point out problems that can arise from applying OCR text within an information retrieval environment, we also suggest solutions to overcome some of these problems.