Mapping physical formats to logical models to extract data and metadata: the defuddle parsing engine

Authors:
Tara D. Talbott;Karen L. Schuchardt;Eric G. Stephan;James D. Myers
Affiliations:
Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;National Center for Supercomputing Applications, Urbana, IL
Venue:
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Year:
2006

Citing 2
Cited 1

Enabling massive scale document transformation for the semantic web: the universal parsing agent™

Proceedings of the 2005 ACM symposium on Document engineering
Adapting the electronic laboratory notebook for the semantic era

CTS'05 Proceedings of the 2005 international conference on Collaborative technologies and systems

Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientists, motivated by the desire for systems-level understanding of phenomena, increasingly need to share their results across multiple disciplines. Accomplishing this requires data to be annotated, contextualized, and readily searchable and translated into other formats. While these requirements can be addressed by custom programming or obviated by community standardization, neither approach has ‘solved' the problem. In this paper, we describe a complementary approach – a general capability for articulating the format of arbitrary textual and binary data using a logical data model, expressed in XMLSchema, which can be used to provide annotation and context, extract metadata, and enable translation. This work is based on the draft specification for the Data Format Description Language and our open source “Defuddle” parser. We present an overview of the specification, detail the design of Defuddle, and discuss the benefits and challenges of this general approach to enabling discovery, sharing, and interpretation of diverse data sets.