Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres

Authors:
Robert B. Allen;Ilya Waldstein;Weizhong Zhu
Affiliations:
College of Information Science and Technology, Drexel University,;College of Information Science and Technology, Drexel University,;College of Information Science and Technology, Drexel University,
Venue:
ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Year:
2008

Citing 5
Cited 2

Metadata and data structures for the historical newspaper digital library

Proceedings of the eighth international conference on Information and knowledge management
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A focus-context browser for multiple timelines

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Toward a metadata standard for digitized historical newspapers

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A framework for text processing and supporting access to collections of digitized historical newspapers

Proceedings of the 2007 conference on Human interface: Part II

Exploring History with Narrative Timelines

Proceedings of the Symposium on Human Interface 2009 on ConferenceUniversal Access in Human-Computer Interaction. Part I: Held as Part of HCI International 2009
Automated processing of digitized historical newspapers beyond the article level: sections and regular features

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther.