Document conversion for cultural heritage texts: FrameMaker to HTML revisited

Authors:
Michael Piotrowski
Affiliations:
Law Sources Foundation of the Swiss Lawyers Society, Zurich, Switzerland
Venue:
Proceedings of the 10th ACM symposium on Document engineering
Year:
2010

Citing 3
Cited 1

Writing documents for paper and WWW: a strategy based on FrameMaker and WebMaker

Selected papers of the first conference on World-Wide Web
On lexical resources for digitization of historical documents

Proceedings of the 9th ACM symposium on Document engineering
Leveraging back-of-the-book indices to enable spatial browsing of a historical document collection

Proceedings of the 6th Workshop on Geographic Information Retrieval

Harvesting indices to grow a controlled vocabulary: towards improved access to historical legal texts

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many large-scale digitization projects are currently under way that intend to preserve the cultural heritage contained in paper documents (in particular books) and make it available on the Web. Typically OCR is used to produce searchable electronic texts from books. For newer books, approximately from the late 1980s onwards,digital text may already exist in the form of typesetting data. For applications that require a higher level of accuracy than OCR can deliver, the conversion of typesetting data can thus be an alternative to manual keying. In this paper, we describe a tool for converting typesetting data in FrameMaker format to XHTML+CSS developed for a collection of source editions of medieval and early modern documents. Even though the books of the Collection are typeset in good quality and in modern typefaces, OCR is unusable,since the text is in various historical forms of German, French,Italian, Rhaeto-Romanic, and Latin. The conversion of typesetting data produces fully reliable text free from OCR errors and thus also provides a basis for the construction of language resources for the processing of historical texts.