Text Retrieval from Document Images Based on Word Shape Analysis

Authors:
Chew Lim Tan;Weihua Huang;Sam Yuan Sung;Zhaohui Yu;Yi Xu
Affiliations:
School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543;School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543;School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543;School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543;School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543
Venue:
Applied Intelligence
Year:
2003

Citing 0
Cited 6

Computer-based plagiarism detection methods and tools: an overview

CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Feature string-based intelligent information retrieval from Tamil document images

International Journal of Computer Applications in Technology
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review
Using the absolute difference of term occurrence probabilities in binary text categorization

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.