Word Spotting: A New Approach to Indexing Handwriting
CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Features for Word Spotting in Historical Manuscripts
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A search engine for historical manuscript images
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate similarity search in metric spaces using inverted files
Proceedings of the 3rd international conference on Scalable information systems
An approach to content-based image retrieval based on the Lucene search engine library
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
A framework to access handwritten information within large digitized paper collections
E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)
Digitization and search: A non-traditional use of HPC
E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)
Hi-index | 0.00 |
An improved approach towards enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the system using regular ASCII text, any query is rendered as an image and a ranked list of matching results is presented to the user. Among other pre-processing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size.