Recognizing Vietnamese Online Handwritten Separated Characters

  • Authors:
  • Duy Khuong Nguyen;The Duy Bui

  • Affiliations:
  • -;-

  • Venue:
  • ALPIT '08 Proceedings of the 2008 International Conference on Advanced Language Processing and Web Information Technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Vietnamese alphabet is based on the Latin alphabet with the addition of nine accent marks or diacritics — four of them to create additional sounds, and the other five to indicate the tone of each word. Because Vietnamese is a tonal language that uses tone to distinguish words, recognizing diacritics is an important part in recognizing Vietnamese word. However, in written form, diacritics are much smaller then the characters, which make very them hard to recognize. Previous works on Vietnamese characters recognition often pre-process input with a graph-based approach by trying to separate the main characters with their diacritics by determining connected regions at pixel level. This approach, however, only works well where the input contains only characters with separable diacritics, for example, scanned image of printed documents. We propose in this paper a robust method to recognize online Vietnamese characters with diacritics. Using cosine transformation with appropriated sampling algorithms, we represent multiple strokes of a character together in a single set of features. This set of features is then used as the input for a well designed machine learning based system. We have tested our system on the combination of Vietnamese characters with diacritics and Section 1c (isolated characters) of the Unipen data set, and have obtained very competitive results.