Decoding Substitution Ciphers by Means of Word Matching with Application to OCR

Authors:
G. Nagy;S. Seth;K. Einspahr
Affiliations:
-;-;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1987

Citing 0
Cited 6

Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence
In search of meaning for time series subsequence clustering: matching algorithms based on a new distance measure

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
An exact A* method for deciphering letter-substitution ciphers

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Interactive, mobile, distributed pattern recognition

ICIAP'05 Proceedings of the 13th international conference on Image Analysis and Processing
Estimation, learning, and adaptation: systems that improve with use

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition

Quantified Score

Hi-index	0.14

Visualization

Abstract

A substitution cipher consists of a block of natural language text where each letter of the alphabet has been replaced by a distinct symbol. As a problem in cryptography, the substitution cipher is of limited interest, but it has an important application in optical character recognition. Recent advances render it quite feasible to scan documents with a fairly complex layout and to classify (cluster) the printed characters into distinct groups according to their shape. However, given the immense variety of type styles and forms in current use, it is not possible to assign alphabetical identities to characters of arbitrary size and typeface. This gap can be bridged by solving the equivalent of a substitution cipher problem, thereby opening up the possibility of automatic translation of a scanned document into a standard character code, such as ASCII. Earlier methods relying on letter n-gram frequencies require a substantial amount of ciphertext for accurate n-gram estimates. A dictionary-based approach solves the problem using relatively small ciphertext samples and a dictionary of fewer than 500 words. Our heuristic backtrack algorithm typically visits only a few hundred among the 26! possible nodes on sample texts ranging from 100 to 600 words.