The impact of OCR accuracy and feature transformation on automatic text classification

  • Authors:
  • Mayo Murata;Lazaro S. P. Busagala;Wataru Ohyama;Tetsushi Wakabayashi;Fumitaka Kimura

  • Affiliations:
  • Faculty of Engineering, Mie University, Mie, Japan;Faculty of Engineering, Mie University, Mie, Japan;Faculty of Engineering, Mie University, Mie, Japan;Faculty of Engineering, Mie University, Mie, Japan;Faculty of Engineering, Mie University, Mie, Japan

  • Venue:
  • DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.