A study on document retrieval system based on visualization to manage OCR documents

Authors:
Kazuki Tamura;Tomohiro Yoshikawa;Takeshi Furuhashi
Affiliations:
Nagoya University, Nagoya, Japan;Nagoya University, Nagoya, Japan;Nagoya University, Nagoya, Japan
Venue:
HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
Year:
2013

Citing 12
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Recognition of Cursive Roman Handwriting - Past, Present and Future

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Latent dirichlet allocation

The Journal of Machine Learning Research
Japanese OCR error correction using character shape similarity and statistical language model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic visualization: topic model for visualizing documents

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient methods for topic model inference on streaming document collections

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Term weighting schemes for Latent Dirichlet Allocation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.