Automatic keyword extraction from historical document images

  • Authors:
  • Kengo Terasawa;Takeshi Nagasaki;Toshio Kawashima

  • Affiliations:
  • School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan;School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan;School of Systems Information Science, Future University-Hakodate, Hokkaido, Japan

  • Venue:
  • DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.