The design and analysis of spatial data structures
The design and analysis of spatial data structures
Large-Scale Simulation Studies in Image Pattern Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI
IEEE Transactions on Pattern Analysis and Machine Intelligence
An Algorithm for Finding Best Matches in Logarithmic Expected Time
ACM Transactions on Mathematical Software (TOMS)
Multidimensional binary search trees used for associative searching
Communications of the ACM
Computer Modern Typefaces
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Style Context with Second-Order Statistics
IEEE Transactions on Pattern Analysis and Machine Intelligence
Style Consistent Classification of Isogenous Patterns
IEEE Transactions on Pattern Analysis and Machine Intelligence
Safely selecting subsets of training data
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Document: a useful level for facing noisy data
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Pixel accurate document image content extraction
Proceedings of the 2011 ACM Symposium on Applied Computing
Hi-index | 0.00 |
The research goal of highly versatile document analysis systems, capable of performing useful functions on the great majority of document images, seems to be receding, even in the face of decades of research. One family of nearly universally applicable capabilities includes document image content extraction tools able to locate regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. The severity of the methodological problems is suggested by the lack of agreement within the R&D community on even what is meant by a representative set of samples in this context. Even when this is agreed, it is often not clear how sufficiently large sets for training and testing can be collected and ground truthed. Perhaps this can be alleviated by discovering a principled way to amplify sample sets using synthetic variations. We will then need classification methodologies capable of learning automatically from these huge sample sets in spite of their poorly parameterized—or unparameterizable—distributions. Perhaps fast expected-time approximate k-nearest neighbors classifiers are a good solution, even if they tend to require enormous data structures: hashed k-d trees seem promising. We discuss these issues and report recent progress towards their resolution. Keyword: versatile document analysis systems, DAS methodology, document image content extraction, classification, k Nearest Neighbors, k-d trees, CART, spatial data structures, computational geometry, hashing