Retrieval of machine-printed Latin documents through Word Shape Coding

Authors:
Shijian Lu;Chew Lim Tan
Affiliations:
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore;Department of Computer Science, School of Computing, National University of Singapore, Singapore
Venue:
Pattern Recognition
Year:
2008

Citing 18
Cited 2

Connected components in binary images: the detection problem

Connected components in binary images: the detection problem
Evaluation of Binarization Methods for Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Summarization of imaged documents without OCR

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Information Retrieval from Documents: A Survey

Information Retrieval
Document Ranking and the Vector-Space Model

IEEE Software
Document image similarity and equivalence detection

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Using Character Shape Coding for Information Retrieval

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Text Retrieval from Document Images Based on Word Shape Analysis

Applied Intelligence
Modeling content identification from document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Content-oriented categorization of document images

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Script and Language Identification in Noisy and Degraded Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Script and language identification in degraded and distorted document images

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Adaptive document block segmentation and classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Adaptive median filters: new algorithms and results

IEEE Transactions on Image Processing

Segment confidence-based binary segmentation (SCBS) for cursive handwritten words

Expert Systems with Applications: An International Journal
Text extraction from scene images by character appearance and structure modeling

Computer Vision and Image Understanding

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper reports a document retrieval technique that retrieves machine-printed Latin-based document images through word shape coding. Adopting the idea of image annotation, a word shape coding scheme is proposed, which converts each word image into a word shape code by using a few shape features. The text contents of imaged documents are thus captured by a document vector constructed with the converted word shape code and word frequency information. Similarities between different document images are then gauged based on the constructed document vectors. We divide the retrieval process into two stages. Based on the observation that documents of the same language share a large number of high-frequency language-specific stop words, the first stage retrieves documents with the same underlying language as that of the query document. The second stage then re-ranks the documents retrieved in the first stage based on the topic similarity. Experiments show that document images of different languages and topics can be retrieved properly by using the proposed word shape coding scheme.