A Data Mining Approach to Reading Order Detection

  • Authors:
  • M. Ceci;M. Berardi;G. Porcelli;D. Malerba

  • Affiliations:
  • University of Bari;University of Bari;University of Bari;University of Bari

  • Venue:
  • ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Determining the reading order for layout components ex- tracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the prob- lem of detecting the reading order of layout components by resorting to a data mining approach which acquires the do- main specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.