Locating and Recognizing Text in WWW Images

  • Authors:
  • Daniel Lopresti;Jiangying Zhou

  • Affiliations:
  • Bell Laboratories, Lucent Technologies, Inc., 600 Mountain Avenue, Murray Hill, NJ 07974, USA. dpl@research.bell-labs.com;Summus Ltd., Suite 2200, 2000 Center Point Drive, Columbia, SC 29210, USA. jiangying@summus.com

  • Venue:
  • Information Retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.