Recovering data semantics from XML documents into DTD graph with SAX

  • Authors:
  • Herbert Shiu;Joseph Fong;Robert P. Biuk-Aghai

  • Affiliations:
  • Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Department of Computer and Information Science, University of Macau, Macau

  • Venue:
  • ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema, DTD Graphs. The necessity for doing so is due to the fact that XML documents are frequently used for storing structured data and their schemas, such as in Document Type Definition (DTD) format, are missing, especially for those existing historical XML documents. As such, it is difficult for software developers or end users to make use of them. Even the schemas exist, they are difficult to read and undetermined of the underlying relationships among the elements in the documents. In view of this, it is necessary to determine the data semantics from the XML documents. If the DTDs of the XML documents exist with the identifications of the ID/IDREF(S) type attributes, then more data semantics can be derived. Another application of the determined data semantics is to verify the linkages implemented by ID/IDREF(S). If the element is referring to an incorrect XML element type, an extra data semantic will be determined as a result, and such findings can be used for verification purposes. Furthermore, the approaches proposed in this paper use Simple API for XML (SAX) so that the algorithms are applicable to small to huge sized XML documents.