Information retrieval and OCR: from converting content to grasping meaning

  • Authors:
  • Jamie Callan;Paul Kantor;David Grossman

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA;Rutgers University, New Brunswick, NJ;Illinois Institute of Technology, Chicago, IL

  • Venue:
  • ACM SIGIR Forum
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

IR and OCR have largely developed independent standards and metrics, with OCR focused on literal accuracy, and IR focused on essential "content/meaning". With more and more media not only paper, but in multiple image formats, the opportunities and challenges for OCR on new formats -- video and still images -- are enormous. While OCR is assessed in metrics that emphasize words and characters, IR has learned to apply end-to-end metrics that ask whether the needs of the users can be met by existing systems. The same considerations apply also to the problem of providing permanent worldwide access to millions of pages of legacy print documents, representing the shared human record as it existed until just a few years ago.The International Society for Optical Engineering (SPIE) has held a series of Document Recognition and Retrieval (DRR) conferences. The tenth, DRR X will be held in January 2003, in Santa Clara California. In 2001, Dan LoPresti of Bell Labs decided that the area would benefit from more intense collaboration between those who specialize in finding the words on a page image, and those researchers who know how to find the right documents, given the words. He invited Paul Kantor (Rutgers) to join the DRR Chairs, and together they invited Dave Lewis (Consultant) to give a keynote address at DRR VIII. Dan then stepped down. Paul chaired DRR IX (2002) and then handed the reins to Tapas Kanungo (IBM, Almaden) and together they invited Jamie Callan (CMU), David Grossman (IIT) and Alex Hauptmann (CMU) to join the conference committee for DRR X.To improve communication between SIGIR and DRR, this group proposed a SIGIR workshop on this area. The workshop on "Information Retrieval and OCR: From Converting Content to Grasping Meaning" was intended to stimulate cross-fertilization between OCR and IR, in hopes that better use of IR will enable the OCR community to avoid expensive hand processing, and to demonstrate that the combination of present static and dynamic image processing and present state-of-the-art robust information retrieval can generate substantial advances in both extraction of messages from image streams and conversion of existing paper variants. It solicited papers dealing with future applications, such as the indexing and retrieval of text embedded in static or video graphic images, with problems of skew, distortion, and obscuration, as well as state-of-the-art discussions of the storage and retrieval of handwritten or print legacy materials.The workshop was held on August 15, 2002 in Tampere, Finland, immediately following the SIGIR 2002 conference. Although the workshop was intended to appeal to a wide range of IR and OCR researchers (and indeed was proposed at the request of colleagues from the OCR community), it primarily drew people with a background in IR. About a dozen people participated. The small size allowed a very interactive, seminar-style format and very vigorous discussion between and during presentations. Most presentations ran 30% to 50% longer than planned, and our impression is that most of the participants found it very productive.