Searching Off-line Arabic Documents

Authors:
Jim Chan;Celal Ziftci;David Forsyth
Affiliations:
University of Illinois, Urbana;University of Illinois, Urbana;University of Illinois, Urbana
Venue:
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Year:
2006

Citing 0
Cited 11

Retrieval of Ottoman documents

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Matching ottoman words: an image retrieval approach to historical document indexing

Proceedings of the 6th ACM international conference on Image and video retrieval
HAH manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents

Expert Systems with Applications: An International Journal
Handwritten word-spotting using hidden Markov models and universal vocabularies

Pattern Recognition
Ottoman archives explorer: A retrieval system for digital Ottoman archives

Journal on Computing and Cultural Heritage (JOCCH)
Unsupervised writer adaptation of whole-word HMMs with application to word-spotting

Pattern Recognition Letters
Efficient search in document image collections

ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Lexicon-free handwritten word spotting using character HMMs

Pattern Recognition Letters
Contextual word spotting in historical manuscripts using Markov logic networks

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Statistical script independent word spotting in offline handwritten documents

Pattern Recognition
Boosting the handwritten word spotting experience by including the user in the loop

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently an abundance of historical manuscripts, journals, and scientific notes remain largely unaccessible in library archives. Manual transcription and publication of such documents is unlikely, and automatic transcription with high enough accuracy to support a traditional text search is difficult. In this work we describe a lexicon-free system for performing text queries on off-line printed and handwritten Arabic documents. Our segmentation-based approach utilizes gHMMs with a bigram letter transition model, and KPCA/LDA for letter discrimination. The segmentation stage is integrated with inference. We show that our method is robust to varying letter forms, ligatures, and overlaps. Additionally, we find that ignoring letters beyond the adjoining neighbors has little effect on inference and localization, which leads to a significant performance increase over standard dynamic programming. Finally, we discuss an extension to perform batch searches of large word lists for indexing purposes.