A systematic comparison of various statistical alignment models
Computational Linguistics
ICML '06 Proceedings of the 23rd international conference on Machine learning
Neural Computation
A Novel Connectionist System for Unconstrained Handwriting Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Edit-distance of weighted automata
CIAA'02 Proceedings of the 7th international conference on Implementation and application of automata
OCR Post-processing Using Weighted Finite-State Transducers
ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Generating search term variants for text collections with historic spellings
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
LSTM recurrent networks learn simple context-free and context-sensitive languages
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Historical text presents numerous challenges for contemporary different techniques, e.g. information retrieval, OCR and POS tagging. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system which requires reference to a fixed lexicon accessed by orthographic form. For example, language modeling or retrieval engine for historical text which is produced by OCR systems, where the spelling of words often differ in various way, e.g. one word might have different spellings evolved over time. It is very important to aid those techniques with the rules for automatic mapping of historical wordforms. In this paper, we propose a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture. Because the network is recurrent, the considered context is not limited to a fixed size especially due to memory cells which are designed to deal with long-term dependencies. In the set of experiments conducted on the Luther bible database and transform wordforms from Early New High German (ENHG) 14th - 16th centuries to the corresponding modern wordforms in New High German (NHG). We compare our proposed supervised model LSTM to various methods for computing word alignments using statistical, heuristic models. Our new proposed LSTM outperforms the other three state-of-the-art methods. The evaluation shows the accuracy of our model on the known wordforms is 93.90% and on the unknown wordforms is 87.95%, while the accuracy of the existing state-of-the-art combined approach of the wordlist-based and rule-based normalization models is 92.93% for known and 76.88% for unknown tokens. Our proposed LSTM model outperforms on normalizing the modern wordform to historical wordform. The performance on seen tokens is 93.4%, while for unknown tokens is 89.17%.