Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Recognition of Cursive Roman Handwriting - Past, Present and Future
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
The Journal of Machine Learning Research
Japanese OCR error correction using character shape similarity and statistical language model
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Probabilistic topic decomposition of an eighteenth-century American newspaper
Journal of the American Society for Information Science and Technology
ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic visualization: topic model for visualizing documents
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient methods for topic model inference on streaming document collections
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Term weighting schemes for Latent Dirichlet Allocation
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Evaluating models of latent document semantics in the presence of OCR errors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Divergence measures based on the Shannon entropy
IEEE Transactions on Information Theory
Hi-index | 0.00 |
Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.