Analysis of whole-book recognition

Authors:
Pingping Xiu;Henry S. Baird
Affiliations:
Lehigh Univ., Bethlehem, PA;Lehigh Univ., Bethlehem, PA
Venue:
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Year:
2010

Citing 8
Cited 1

Degraded text recognition using visual and linguistic context

Degraded text recognition using visual and linguistic context
The complexity of theorem-proving procedures

STOC '71 Proceedings of the third annual ACM symposium on Theory of computing
Style consistency in pattern fields

Style consistency in pattern fields
Style Context with Second-Order Statistics

IEEE Transactions on Pattern Analysis and Machine Intelligence
Style Consistent Classification of Isogenous Patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Towards Whole-Book Recognition

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
Scaling Up Whole-Book Recognition

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

Document: a useful level for facing noisy data

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Whole-book recognition is a document image analysis strategy that operates on the complete set of a book's page images, attempting to improve accuracy by automatic unsupervised adaptation. Our algorithm expects to be given initial iconic and linguistic models---derived from (generally errorful) OCR results and (generally incomplete) dictionaries---and then, guided entirely by evidence internal to the test set, the algorithm corrects the models yielding improved accuracy. We have found that successful corrections are often closely associated with "disagreements" between the models which can be detected within the test set by measuring cross entropy between (a) the posterior probability distribution of character classes (the recognition results from image classification alone), and (b) the posterior probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). We report experiments on long passages (up to 180 pages) revealing that: (1) disagreements and error rates are strongly correlated; (2) our algorithm can drive down recognition error rates by nearly an order of magnitude; and (3) the longer the passage, the lower the error rate achievable. We also propose formal models for a book's text, for iconic and linguistic constraints, and for our whole-book recognition algorithm---and, using these, we rigorously prove sufficient conditions for the whole-book recognition strategy to succeed in the ways illustrated in the experiments.