A General System for the Retrieval of Document Images from Digital Libraries

  • Authors:
  • Simone Marinai;Emanuele Marino;Francesca Cesarini;Giovanni Soda

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

Large collections of scanned documents (books and journals) are now available in Digital Libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems.In this paper we describe a system that we developed in order to retrieve information from digitized books and journals belonging to Digital Libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performingOCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.