Unsupervised Newspaper Segmentation Using Language Context

  • Authors:
  • R. Furmaniak

  • Affiliations:
  • University of Waterloo

  • Venue:
  • ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been increased interest in digitization of news- paper archives. A major problem that must be solved is that of high accuracy decomposition of the page into its logical structure. In this paper I present an approach that uses a language similarity measure based on OCR results to train geometric layout rules tailored to an arbitrary title. Exper- iments have shown this approach to be very effective.