Normalizing historical orthography for OCR historical documents using LSTM

Authors:
Mayce Al Azawi;Muhammad Zeshan Afzal;Thomas M. Breuel
Affiliations:
University of Kaiserslautern, Kaiserslautern, Germany;University of Kaiserslautern, Kaiserslautern, Germany;University of Kaiserslautern, Kaiserslautern, Germany
Venue:
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Year:
2013

Citing 9
Cited 0

A systematic comparison of various statistical alignment models

Computational Linguistics
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

ICML '06 Proceedings of the 23rd international conference on Machine learning
Long Short-Term Memory

Neural Computation
A Novel Connectionist System for Unconstrained Handwriting Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Edit-distance of weighted automata

CIAA'02 Proceedings of the 7th international conference on Implementation and application of automata
OCR Post-processing Using Weighted Finite-State Transducers

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
LSTM recurrent networks learn simple context-free and context-sensitive languages

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Historical text presents numerous challenges for contemporary different techniques, e.g. information retrieval, OCR and POS tagging. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system which requires reference to a fixed lexicon accessed by orthographic form. For example, language modeling or retrieval engine for historical text which is produced by OCR systems, where the spelling of words often differ in various way, e.g. one word might have different spellings evolved over time. It is very important to aid those techniques with the rules for automatic mapping of historical wordforms. In this paper, we propose a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture. Because the network is recurrent, the considered context is not limited to a fixed size especially due to memory cells which are designed to deal with long-term dependencies. In the set of experiments conducted on the Luther bible database and transform wordforms from Early New High German (ENHG) 14th - 16th centuries to the corresponding modern wordforms in New High German (NHG). We compare our proposed supervised model LSTM to various methods for computing word alignments using statistical, heuristic models. Our new proposed LSTM outperforms the other three state-of-the-art methods. The evaluation shows the accuracy of our model on the known wordforms is 93.90% and on the unknown wordforms is 87.95%, while the accuracy of the existing state-of-the-art combined approach of the wordlist-based and rule-based normalization models is 92.93% for known and 76.88% for unknown tokens. Our proposed LSTM model outperforms on normalizing the modern wordform to historical wordform. The performance on seen tokens is 93.4%, while for unknown tokens is 89.17%.