Automatic keyword extraction from historical document images

Authors:
Kengo Terasawa;Takeshi Nagasaki;Toshio Kawashima
Affiliations:
School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan;School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan;School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan
Venue:
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Year:
2006

Citing 8
Cited 2

Word Spotting: A New Approach to Indexing Handwriting

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Word Spotting in Chinese Document Images without Layout Analysis

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Features for Word Spotting in Historical Manuscripts

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Indexing and retrieval of words in old documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A Segmentation-free Approach for Keyword Search in Historical Typewritten Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Eigenspace Method for Text Retrieval in Historical Document Images

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
On Appearance-Based Feature Extraction Methods for Writer-Independent Handwritten Text Recognition

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Eigenfaces for recognition

Journal of Cognitive Neuroscience

Document image analysis for active reading

SADPI '07 Proceedings of the 2007 international workshop on Semantically aware document processing and indexing
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.