Using Lucene to index and search the digitized 1940 US census

Authors:
Liana Diesendruck;Rob Kooper;Luigi Marini;Kenton McHenry
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Year:
2013

Citing 7
Cited 0

Word Spotting: A New Approach to Indexing Handwriting

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Features for Word Spotting in Historical Manuscripts

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate similarity search in metric spaces using inverted files

Proceedings of the 3rd international conference on Scalable information systems
An approach to content-based image retrieval based on the Lucene search engine library

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
A framework to access handwritten information within large digitized paper collections

E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)
Digitization and search: A non-traditional use of HPC

E-SCIENCE '12 Proceedings of the 2012 IEEE 8th International Conference on E-Science (e-Science)

Quantified Score

Hi-index	0.00

Visualization

Abstract

An improved approach towards enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the system using regular ASCII text, any query is rendered as an image and a ranked list of matching results is presented to the user. Among other pre-processing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size.