Retrieval by Layout Similarity of Documents Represented with MXY Trees

  • Authors:
  • Francesca Cesarini;Simone Marinai;Giovanni Soda

  • Affiliations:
  • -;-;-

  • Venue:
  • DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

Document image retrieval can be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a tree-based representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both global features of the page and a vectorial representation of its layout that is derived from the corresponding MXY tree. Occurrences of tree patterns are handled similarly to index terms in Information Retrieval in order to compute the similarity. When retrieving relevant documents, the images in the collection are sorted on the basis of a measure that is the combination of two values describing the similarity of global features and of the occurrences of tree patterns. The system is applied to the retrieval of documents belonging to digital libraries. Tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century, and to a collection of monographs printed in the same Century and containing more than 600 pages.