XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

Authors:
Jan Hegewald;Felix Naumann;Melanie Weis
Affiliations:
Humboldt-Universitat zu Berlin;Humboldt-Universitat zu Berlin;Humboldt-Universitat zu Berlin
Venue:
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Year:
2006

Citing 0
Cited 10

Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Output schemas of XSLT stylesheets and their applications

Information Sciences: an International Journal
Inference of concise regular expressions and DTDs

ACM Transactions on Database Systems (TODS)
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

ACM Transactions on the Web (TWEB)
Recovering data semantics from XML documents into DTD graph with SAX

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Dealing with large schema sets in mobile SOS-based applications

Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications
Instance-based XML data binding for mobile devices

Proceedings of the Third International Workshop on Middleware for Pervasive Mobile and Embedded Computing
An unsupervised approach for acquiring ontologies and RDF data from online life science databases

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
User profile integration made easy: model-driven extraction and transformation of social network schemas

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema - efficient querying and storage of XML data, semantic verification, data integration, etc.- this schema must be extracted. In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [5], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element's contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties.