Development of Nom character segmentation for collecting patterns from historical document pages

  • Authors:
  • Truyen Van Phan;Bilan Zhu;Masaki Nakagawa

  • Affiliations:
  • Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan;Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan;Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan

  • Venue:
  • Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present the first effort in preprocessing and character segmentation on digitized Nom document pages toward their digital archiving. Nom is an ideographic script to represent Vietnamese, used from the 10th century to 20th century. Because of various complex layouts, we propose an efficient method based on connected component analysis for extraction of characters from images. The area Voronoi diagram is then employed to represent the neighborhood and boundary of connected components. Based on this representation, each character can be considered as a group of extracted adjacent Voronoi regions. To improve the performance of segmentation, we use the recursive x-y cut method to segment separated regions. We evaluate the performance of this method on several pages in different layouts. The results confirm that the method is effective for character segmentation in Nom documents.