Decoding Substitution Ciphers by Means of Word Matching with Application to OCR
IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic text processing
Breaking substitution ciphers using a relaxation algorithm
Communications of the ACM
Managing Gigabytes: Compressing and Indexing Documents and Images
Managing Gigabytes: Compressing and Indexing Documents and Images
Document Image Decoding Using Markov Source Models
IEEE Transactions on Pattern Analysis and Machine Intelligence
The Detection of Duplicates in Document Image Databases
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Duplicate Detection for Symbolically Compressed Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
IEEE Transactions on Circuits and Systems for Video Technology
A natural language approach to automated cryptanalysis of two-time pads
Proceedings of the 13th ACM conference on Computer and communications security
Modelling Stem Cells Lineages with Markov Trees
PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Crypt analysis of two time pads in case of compressed speech
Computers and Electrical Engineering
Bounding the probability of error for high precision optical character recognition
The Journal of Machine Learning Research
The Journal of Machine Learning Research
Hi-index | 0.14 |
It has been shown that simple substitution ciphers can be solved using statistical methods such as probabilistic relaxation. However, the utility of such solutions has been limited by their inability to cope with noise encountered in practical applications. In this paper, we propose a new solution to substitution deciphering based on hidden Markov models. We show that our algorithm is more accurate than relaxation and much more robust in the presence of noise, making it useful for applications in compressed document processing. Recovering character interpretations from the sequence of cluster identifiers in a symbolically compressed document can be treated as a cipher problem. Although a significant amount of noise is present in the cluster sequence, enough information can be recovered with a robust deciphering algorithm to accomplish certain document analysis tasks. The feasibility of this approach is demonstrated in a multilingual document duplicate detection system.