Beyond myopic inference in big data pipelines
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Normalizing historical orthography for OCR historical documents using LSTM
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Hi-index | 0.00 |
A new approach for Stochastic Error-Correcting Language Modeling based on Weighted Finite-State Transducers (WFSTs) is proposed as a method to post-process the results of an optical character recognizer (OCR). Instead of using the recognized string as an input to the transducer, in our approach the complete set of OCR hypotheses, a sequence of vectors of a posteriori class probabilities, is used to build a WFST that is then composed with independent WFSTs for the error and language models. This combines the practical advantages of a de-coupled (OCR + post-processor) model with the full power of an integrated model.