Imaged Document Text Retrieval Without OCR
IEEE Transactions on Pattern Analysis and Machine Intelligence
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
Proceedings of the 15th International Conference on Extending Database Technology
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A new family of symbolic compression algorithms has recently been developed that includes the ongoing JBIG2 standardization effort as well as related commercial products. These techniques are specifically designed for binary document images. They cluster individual blobs in a document and store the sequence of occurrence of blobs and representative blob templates, hence the name symbolic compression.This paper describes a method for duplicate detection on symbolically compressed document images. It recognizes the text in an image by deciphering the sequence of occurrence of blobs in the compressed representation. We propose a Hidden Markov Model (HMM) method for solving such deciphering problems and suggest applications in multilingual document duplicate detection.