A study on document retrieval system based on visualization to manage OCR documents

  • Authors:
  • Kazuki Tamura;Tomohiro Yoshikawa;Takeshi Furuhashi

  • Affiliations:
  • Nagoya University, Nagoya, Japan;Nagoya University, Nagoya, Japan;Nagoya University, Nagoya, Japan

  • Venue:
  • HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.