Writing documents for paper and WWW: a strategy based on FrameMaker and WebMaker
Selected papers of the first conference on World-Wide Web
On lexical resources for digitization of historical documents
Proceedings of the 9th ACM symposium on Document engineering
Leveraging back-of-the-book indices to enable spatial browsing of a historical document collection
Proceedings of the 6th Workshop on Geographic Information Retrieval
LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Hi-index | 0.00 |
Many large-scale digitization projects are currently under way that intend to preserve the cultural heritage contained in paper documents (in particular books) and make it available on the Web. Typically OCR is used to produce searchable electronic texts from books. For newer books, approximately from the late 1980s onwards,digital text may already exist in the form of typesetting data. For applications that require a higher level of accuracy than OCR can deliver, the conversion of typesetting data can thus be an alternative to manual keying. In this paper, we describe a tool for converting typesetting data in FrameMaker format to XHTML+CSS developed for a collection of source editions of medieval and early modern documents. Even though the books of the Collection are typeset in good quality and in modern typefaces, OCR is unusable,since the text is in various historical forms of German, French,Italian, Rhaeto-Romanic, and Latin. The conversion of typesetting data produces fully reliable text free from OCR errors and thus also provides a basis for the construction of language resources for the processing of historical texts.