Page Classification for Meta-data Extraction from Digital Collections

Authors:
Francesca Cesarini;Marco Lastri;Simone Marinai;Giovanni Soda
Affiliations:
-;-;-;-
Venue:
DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Year:
2001

Citing 6
Cited 1

Extraction of data from preprinted forms

Machine Vision and Applications - Special issue: document image analysis techniques
Modeling Documents for Structure Recognition Using Generalized N-Grams

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Clustering and classification of document structure-a machine learning approach

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Recursive X-Y cut using bounding boxes of connected components

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Document Image Layout Comparison and Classification

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Structured Document Segmentation and Representation by the Modified X-Y tree

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition

Clustered trie structures for approximate search in hierarchical objects collections

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchicalrepresen tation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontaland verticalcuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificialneuralnet work that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are "title page", "index", "regular page". Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century.