Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres

  • Authors:
  • Robert B. Allen;Ilya Waldstein;Weizhong Zhu

  • Affiliations:
  • College of Information Science and Technology, Drexel University,;College of Information Science and Technology, Drexel University,;College of Information Science and Technology, Drexel University,

  • Venue:
  • ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther.