Lampung - a new handwritten character benchmark: database, labeling and recognition
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Hi-index | 0.00 |
One of the major issues in handwritten character recognition is the efficient creation of ground truth to train and test the different recognizers. The manual labeling of the data by a human expert is a tedious and costly procedure. In this paper we propose an efficient and low-cost semi-automatic labeling system for character datasets. First, the data is represented in different abstraction levels, which is clustered after in an unsupervised manner. The different clusters are labeled by the human experts and finally an unanimity voting is considered to decide if a label is accepted or not. The experimental results prove that labeling only less than 0.5% of the training data is sufficient to achieve 86.21% recognition rate for a brand new script (Lampung) and 94.81% for the MNIST benchmark dataset, considering only a K-nearest neighbor classifier for recognition.