Imaged Document Text Retrieval Without OCR

Authors:
Chew Lim Tan;Weihua Huang;Zhaohui Yu;Yi Xu
Affiliations:
National Univ. of Singapore, Kent Ridge;National Univ. of Singapore, Kent Ridge;Toronto, Ontario, Canada;Agilent Technologies Singapore Pte Ltd., Alexandra Road, Singapore
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2002

Citing 6
Cited 23

Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document image similarity and equivalence detection

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Extraction of Indicative Summary Sentences from Imaged Documents

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Duplicate Detection for Symbolically Compressed Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Content-Based Indexing and Retrieval Method of Chinese Document Images

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition

Word Searching in Document Images Using Word Portion Matching

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Graphics Recognition - from Re-engineering to Retrieval

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Indexing and retrieval of words in old documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Image Retrieval Based on Density Distribution Feature and Key Block Feature

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document image analysis for active reading

SADPI '07 Proceedings of the 2007 international workshop on Semantically aware document processing and indexing
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Text image matching without language model using a Hausdorff distance

Information Processing and Management: an International Journal
A word shape coding method for camera-based document images

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Feature string-based intelligent information retrieval from Tamil document images

International Journal of Computer Applications in Technology
Text retrieval from early printed books

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
A novel adaptive morphological approach for degraded character image segmentation

Pattern Recognition
A Document Image Retrieval System

Engineering Applications of Artificial Intelligence
PaperComp 2010: first international workshop on paper computing

Proceedings of the 12th ACM international conference adjunct papers on Ubiquitous computing - Adjunct
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review
Keyword spotting on korean document images by matching the keyword image

ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
Efficient word retrieval by means of SOM clustering and PCA

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Amharic document image retrieval using morphological coding

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Near-duplicate document image matching: A graphical perspective

Pattern Recognition

Quantified Score

Hi-index	0.14

Visualization

Abstract

We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method.