Metadata and data structures for the historical newspaper digital library
Proceedings of the eighth international conference on Information and knowledge management
Advances in domain independent linear text segmentation
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A focus-context browser for multiple timelines
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Toward a metadata standard for digitized historical newspapers
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 2007 conference on Human interface: Part II
Exploring History with Narrative Timelines
Proceedings of the Symposium on Human Interface 2009 on ConferenceUniversal Access in Human-Computer Interaction. Part I: Held as Part of HCI International 2009
ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
Hi-index | 0.00 |
Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose additional techniques which may improve the accuracy still farther.