Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain

Authors:
Guido Sautter;Klemens Böhm;Donat Agosti;Christiana Klingenberg
Affiliations:
Universität Karlsruhe (TH), Karlsruhe 76128;Universität Karlsruhe (TH), Karlsruhe 76128;Am. Mus. of Nat. Hist., New York NY 10024-5192;Staatliches Museum für Naturkunde, Karlsruhe, 76133
Venue:
ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Year:
2009

Citing 13
Cited 1

Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A review of ontology based query expansion

Information Processing and Management: an International Journal
The complex dynamics of collaborative tagging

Proceedings of the 16th international conference on World Wide Web
Subspace: secure cross-domain communication for web mashups

Proceedings of the 16th international conference on World Wide Web
The two cultures: mashing up web 2.0 and the semantic web

Proceedings of the 16th international conference on World Wide Web
Semantic Web 2.0

IEEE Intelligent Systems
Embracing "Web 3.0"

IEEE Internet Computing
Near-Term Prospects for Semantic Technologies

IEEE Intelligent Systems
Scalable querying services over fuzzy ontologies

Proceedings of the 17th international conference on World Wide Web
Approximating OWL-DL ontologies

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
LUBM: A benchmark for OWL knowledge base systems

Web Semantics: Science, Services and Agents on the World Wide Web
Ontologies are us: a unified model of social networks and semantics

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
Information retrieval in folksonomies: search and ranking

ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications

High-throughput crowdsourcing mechanisms for complex tasks

SocInfo'11 Proceedings of the Third international conference on Social informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.