Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A review of ontology based query expansion
Information Processing and Management: an International Journal
The complex dynamics of collaborative tagging
Proceedings of the 16th international conference on World Wide Web
Subspace: secure cross-domain communication for web mashups
Proceedings of the 16th international conference on World Wide Web
The two cultures: mashing up web 2.0 and the semantic web
Proceedings of the 16th international conference on World Wide Web
IEEE Intelligent Systems
IEEE Internet Computing
Near-Term Prospects for Semantic Technologies
IEEE Intelligent Systems
Scalable querying services over fuzzy ontologies
Proceedings of the 17th international conference on World Wide Web
Approximating OWL-DL ontologies
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
LUBM: A benchmark for OWL knowledge base systems
Web Semantics: Science, Services and Agents on the World Wide Web
Ontologies are us: a unified model of social networks and semantics
ISWC'05 Proceedings of the 4th international conference on The Semantic Web
Information retrieval in folksonomies: search and ranking
ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications
High-throughput crowdsourcing mechanisms for complex tasks
SocInfo'11 Proceedings of the Third international conference on Social informatics
Hi-index | 0.00 |
Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.