Towards inference of more realistic XSDs

Authors:
Irena Mlýnková;Martin Nečaský
Affiliations:
Charles University in Prague, Czech Republic;Charles University in Prague, Czech Republic
Venue:
Proceedings of the 2009 ACM symposium on Applied Computing
Year:
2009

Citing 7
Cited 1

XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Re-engineering structures from Web documents

DL '00 Proceedings of the fifth ACM conference on Digital libraries
The XML web: a first study

WWW '03 Proceedings of the 12th international conference on World Wide Web
DTDs versus XML schema: a practical study

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
ShreX: managing XML documents in relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Even an ant can create an XSD

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications

DaemonX: Design, Adaptation, Evolution, and Management of Native XML (and More Other) Formats

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of their structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. Contrary to existing works, whose aim is to infer as concise schema as possible, we focus on inferring of a more realistic result, i.e. a schema that is closer to human-written ones and bears more precise information. For this purpose we extend and combine the existing verified techniques (such as ACO heuristics or MDL principle) with a set of heuristics exploiting semantics of element/attribute names, thesauri or statistical analysis of input data. Using a set of examples we show and discuss advantages of our proposal.