Towards versatile document analysis systems

Authors:
Henry S. Baird;Matthew R. Casey
Affiliations:
Computer Science & Engineering Dept, Lehigh University, Bethlehem, PA;Computer Science & Engineering Dept, Lehigh University, Bethlehem, PA
Venue:
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Year:
2006

Citing 9
Cited 3

The design and analysis of spatial data structures

The design and analysis of spatial data structures
Large-Scale Simulation Studies in Image Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Computer Modern Typefaces

Computer Modern Typefaces
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Style Context with Second-Order Statistics

IEEE Transactions on Pattern Analysis and Machine Intelligence
Style Consistent Classification of Isogenous Patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence

Safely selecting subsets of training data

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Document: a useful level for facing noisy data

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Pixel accurate document image content extraction

Proceedings of the 2011 ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The research goal of highly versatile document analysis systems, capable of performing useful functions on the great majority of document images, seems to be receding, even in the face of decades of research. One family of nearly universally applicable capabilities includes document image content extraction tools able to locate regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. The severity of the methodological problems is suggested by the lack of agreement within the R&D community on even what is meant by a representative set of samples in this context. Even when this is agreed, it is often not clear how sufficiently large sets for training and testing can be collected and ground truthed. Perhaps this can be alleviated by discovering a principled way to amplify sample sets using synthetic variations. We will then need classification methodologies capable of learning automatically from these huge sample sets in spite of their poorly parameterized—or unparameterizable—distributions. Perhaps fast expected-time approximate k-nearest neighbors classifiers are a good solution, even if they tend to require enormous data structures: hashed k-d trees seem promising. We discuss these issues and report recent progress towards their resolution. Keyword: versatile document analysis systems, DAS methodology, document image content extraction, classification, k Nearest Neighbors, k-d trees, CART, spatial data structures, computational geometry, hashing