XML Data Management: Native XML and XML Enabled DataBase Systems

  • Authors:
  • A. Chaudhri;Roberto Zicari;Awais Rashid

  • Affiliations:
  • -;-;-

  • Venue:
  • XML Data Management: Native XML and XML Enabled DataBase Systems
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

From the Book:The past few years have seen a dramatic increase in the popularity and adoption of XML: the eXtensible Markup Language. This explosive growth is driven by its ability to provide a standardized, extensible means of including semantic information within documents describing semi-structured data. This makes it possible to address the shortcomings of existing markup languages such as HTML and support data exchange in e-business environments.Consider, for instance, the simple HTML document in Figure 1. The data contained in the document is intertwined with information about its presentation. In fact, the tags only describe how the data are to be formatted. There is no semantic information that the data represents a person's name and address. Consequently, an interpreter cannot make any sound judgments about the semantics as the tags could as well have enclosed information about a car and its parts. Systems such as WIRE Aggarwal et al. 1998 can interpret the information by using search templates based on the structure of HTML files and the importance of information enclosed in tags defining headings, etc. However, such interpretation lacks soundness and its accuracy is context dependent. Dynamic web pages, where the data resides in a back-end database and is served using pre-defined templates, reduce the coupling between the data and its representation. However, the semantics of the data can still be confusing when exchanging information in an e-business environment. A particular item could be represented using different names (in the simplest case) in two systems in a business-to-business transaction. This enforces adherence to complex, often proprietary, documentstandards.XML provides inherent support for addressing the above problems, as the data in an XML document is self-describing. However, the increasing adoption of XML has also raised new challenges. One of the key issues is the management of large collections of XML documents. There is a need for tools and techniques for effective storage, retrieval and manipulation of XML data. The aim of this book is to discuss the state-of-the-art in such tools and techniques.This chapter introduces the basics of XML and some related technologies before moving on to providing an overview of issues relating to XML data management and approaches addressing these issues. Only an overview of XML and related technologies is provided as there are several sources covering these concepts in depth.What is XML?XML is a W3C standard for document markup. It makes it possible to define custom tags describing the data enclosed by them. An example XML document containing data about a person is shown in Figure 2. Note that tags in XML can have attributes. However, for simplicity these have not been used in this example.Unlike the HTML document in Figure 1, the document in Figure 2 contains only the data about the person and no representational information. The data and its meaning can be read from the document and formatted in a range of fashions as desired. One standard approach is to use XSL: the eXtensible Stylesheet Language.The flexible nature of XML makes it an ideal basis for defining arbitrary languages. One such example is WML: the Wireless Markup Language. Similarly, the XML schema language used to describe the structure of XML documents is based on XML itself.Well-Formed and Valid XMLAlthough XML syntax is flexible, it is constrained by a grammar that governs the permitted tag names, attachment of attributes to tags and so on. All XML documents must conform to these basic grammar rules. Such conformant documents are said to be well formed and can be interpreted by an XML interpreter. This avoids having to write an interpreter for each XML document instance.In addition to being well formed, the structure of a particular XML document can be validated against a Document Type Definition (DTD) or an XML schema. An XML document conforming to a given DTD or schema is said to be valid.Data-Centric and Document-Centric XMLXML documents can be classified on the basis of data they contain. Data-centric documents capture structured data such as that pertaining to a product catalog, order or invoice. Document-centric documents, on the other hand, capture unstructured data as in articles, books or emails. Of course, the two types can be combined to form hybrid documents that are both data-centric and document-centric. Figure 3 provides examples of data-centric and document-centric XML.XML ConceptsDTDs and XML SchemasBoth DTDs and XML schemas are mechanisms used to define the structure of XML documents. They determine what elements can be contained within the XML document, how they are to be used, what default values their attributes can have and so on. Given a DTD or XML schema and its corresponding XML document, a parser can validate whether the document conforms to the desired structure and constraints. This is particularly useful in data exchange scenarios as DTDs and XML schemas provide and enforce a common vocabulary for the data to be exchanged. XML DTDs are subsets of SGML (Standard Generalized Markup Language) DTDs. An XML DTD lists the various elements and attributes in a document and the context in which they are to be used. It can also list any elements a document cannot contain. However, it does not define constraints such as the number of instances of a particular element within a document, the data type of data within each element and so on. Consequently, they are inherently suitable for document-centric XML as compared to data-centric XML. This is because data typing and instantiation constraints are less critical in the former case. However, they can be and are being used for both types of documents.Figure 4 shows a DTD for the simple XML document in Figure 2. It describes which primitive elements form valid components for the three composite ones: PERSON, NAME and ADDRESS. The keyword #PCDATA signifies that the element does not contain any tags or child elements and only parsed character data.XML schemas differ from DTDs in that the XML schema definition language is based on XML itself. As a result, unlike DTDs, the set of constructs available for defining an XML document is extensible. XML schemas also support namespaces and richer and more complex structures than DTDs. In addition, stronger typing constraints on the data enclosed by a tag can be described as a range of primitive data types such as string, decimal, integer, etc. are supported. This makes XML schemas highly suitable for defining data-centric documents. Another significant advantage is that XML schema definitions can exploit the same data management mechanisms as designed for XML; an XML schema is an XML document itself. This is in direct contrast with DTDs, which require specific support to be built into an XML data management system. Figure 5 shows an XML schema for the simple XML document in Figure 2. The sequence tag is a compositor indicating an ordered sequence of sub-elements. There are other compositors for choice and all. Also, note that, as shown for the ADDRESS element, it is possible to constrain the minimum and maximum instances of an element within a document. Although not shown in the example, it is possible to define custom complex and simple types. For instance, a complex type Address could have been defined for the address element.DOM and SAXDOM and SAX are the two main APIs for manipulating XML documents in an application. They are now part of the Java API for XML Processing (JAXP version 1.1). DOM is the W3C standard Document Object Model, an operating system and programming language independent model for storing and manipulating hierarchical documents in memory. A DOM parser parses an XML document and builds a DOM tree, which can then be used to traverse the various nodes. However, the tree has to be constructed before traversal can commence. As a result, memory management is an issue when manipulating large XML documents. This is highly resource intensive especially in cases where only a small section of the document is to be manipulated.SAX, the Simple API for XML, is a de-facto standard. It differs from DOM in that it uses an event-driven model. Each time a starting or closing tag, or a processing instruction is encountered the program is notified. As a result, the whole document does not need to be parsed before it is manipulated. In fact, sections of the document can be manipulated as they are parsed. Therefore, SAX is better suited to manipulating large documents as compared to DOM.XML-Related TechnologiesXPathXPath, the XML Path Language, provides common syntax and semantics for locating and linking to information contained within an XML document. Using XPath the information can be addressed in two ways:Sum A hierarchical fashion based on the ordering of elements in a document treeSum An arbitrary manner relying on elements in a document tree having unique identifiersA few example XPath expressions, based on the sample XML document in Figure 2, are shown in Figure 6. Example 1 expresses all children named FIRSTNAME in the current focus element. Example 2 selects the child node SURNAME whose parent node is NAME within the current focus element while example 3 tests whether an element is present in the union of the elements NAME and ADDRESS. Note that, although not shown in the examples, it is also possible to specify constraints such as first ADDRESS of the third PERSON in the document.XSLSince an XML document does not contain any representational information, it can be formatted in a flexible manner. A standard approach to formatting XML documents is using XSL, the eXtensible Style sheet Language. The W3C XSL specification is composed of two parts: XSL Formatting Objects (XSL FO) and XSL Transformations (XSLT).XSL FO provides formatting and flow semantics for rendering an XML document. A rendering agent is responsible for interpreting the abstract constructs provided by XSL FO in order to instantiate the representation for a particular medium.XSLT offers constructs to transform information from one organization to another. Although designed to transform an XML vocabulary to an XSL FO vocabulary, XSLT can be used for a range of transformations including those to HTML as shown in Figure 7. The example stylesheet uses a set of simple XSLT templates and XPath expressions to transform a part of the XML document in Figure 2 to HTML. SOAPSOAP is the Simple Object Access Protocol used to invoke code over the Internet using XML and HTTP. The mechanism is similar to Java Remote Method Invocation (RMI). In SOAP, method calls are converted to XML and transmitted over HTTP. SOAP was designed for compatibility with XML schemas though their use is not mandatory. Being based on XML they offer a seamless means to describe and transmit SOAP types.XML Data ManagementSo far, we have discussed the basics of XML and some of its related technologies. The discussion brings to front the fundamental advantages of XML hence providing an insight into the reasons behind its growing popularity and adoption. As more and more organizations and systems employ XML within their information management and exchange strategies, classical data management issues pertaining to its efficient and effective storage, retrieval, querying, indexing and manipulation arise. At the same time, previously uncharted information modeling challenges appear.Database vendors have reacted to these new data and information management needs. Most commercial relational, object-relational and object-oriented database systems offer extensions, plug-ins and other mechanisms to support management of XML data. In addition to this XML support within existing database management systems, native XML databases have been born. These are designed for seamless storage, retrieval and manipulation of XML data and integration with related technologies.With the large number of approaches and solutions available in the market, organizations and system developers with XML data management needs face a variety of challenges: What are the various XML data management solutions available? What are the features, services and tools offered by these different XML data management systems? How can an in-house, custom solution be developed instead of using a commercially available system? Which XML data management system or approach is the best in terms of performance and efficiency for a particular application? Are there any good practice and domain or application-specific guidelines for information modeling with XML? Are there other examples and applications of XML data management within a particular domain?This book is aimed as a support mechanism to address the above challenges. It provides a discussion of the various XML data management approaches employed in a range of products and applications. It also offers some performance and benchmarking results and guidelines relating to information modeling with XML.How this Book Is OrganizedThis book is divided into five parts each containing a coherent and closely related set of chapters. The five parts are as follows. It should be noted that these are self-contained and can be read in any order. Introduction Native XML Databases XML and Relational Databases Applications of XML Performance and BenchmarksEach part is summarized below.Part 1: IntroductionThi Brandin which focuses on guidelines for achieving good grammar and style when modeling information using XML. The author argues that good grammar alleviates the need for redundant domain knowledge required for interpretation of XML by application programs. Good style, on the other hand, ensures improved application performance, especially when it comes to storing, retrieving and managing information. The discussion offers insight into information modeling patterns inherent to XML and common XML information modeling pitfalls.Part 2: Native XML DatabasesTwo native XML database systems: Tamino and eXist are covered in this part. In Chapter 2 Schoening provides an overview of Tamino's architecture and APIs before moving on to discussing its XML storage and indexing features. Querying, tool support and access to data in other types of repositories is also described. The chapter offers a comprehensive discussion of these features that are of key importance during the development of an XML data management application.In a similar fashion Chapter 3 by Meier introduces the various features and APIs of the open source system eXist. However, in contrast with Chapter 2, the main focus is on how query processing works within the system. As a result, the author provides a deeper insight into its indexing and storage architectures. Together both chapters offer a balanced discussion, both on high level application programming features of the two systems and underlying indexing and storage mechanisms pertaining to efficient query processing.Finally in Chapter 4, we have included an example of an embedded XML database system. This is based upon the general-purpose embedded database engine Berkeley DB. Berkeley DB XML is able to store XML documents natively, provides indexing and an XPath query interface. Some of the capabilities of the product are demonstrated through code examples.Part 3: XML and Relational DatabasesThis part provides an interesting mix of products and approaches to XML data management in relational and object-relational database systems. Chapters 5, 6 and 7 discuss three commercial products: IBM DB2, Oracle 9i and MS SQL Server 2000 respectively, while chapters 8 and 9 describe more general, roll-your-own strategies for relational and object-relational systems.Chapter 5 by Benham highlights the technology and architecture of XML data management and information integration products from IBM. The focus is on the DB2 Universal Database and Xperanto. The former is the family of products providing relational and object-relational data management support for XML applications through the DB2 XML Extender, extended SQL and support for web services. The latter is the planned set of products and functions to address information integration requirements. These are aimed at complementing DB2 capabilities with additional support for XML and both structured and unstructured applications.In Chapter 6, Hohenstein discusses similar features of Oracle 9i: the use of Oracle's CLOB functionality and OracleText Cartridge for handling data centric XML documents and the XMLType, a new object type based on the object-relational functionality in Oracle 9i, for managing document centric ones. He presents the Oracle SQL extensions for XML and provides examples on how to use these in order to build XML documents from relational data. Special features and tools for XML such as URI (Uniform Resource Identifier) support, parsers, class generator and Java Beans encapsulating these are also described.In Chapter 7, Rys covers a feature set, similar to the ones in Chapters 5 and 6, for MS SQL Server 2000. He focuses on scenarios involving exporting and importing structured XML data. As a result the focus is on the different building blocks such as HTTP and SOAP access, queryable and updateable XML views, rowset views over XML and XML serialization of relational results. Rowset views and XML serialization are aimed at providing XML support for users more familiar with the relational world. XML views, on the other hand, offer XML-based access to the database for users more comfortable with XML.Collectively, Chapters 5, 6 and 7 furnish an interesting comparison of the functionality offered by the three commercial systems and the various similarities and differences in their XML data management approaches. In contrast, Chapters 8 and 9 by Edwards and Brown respectively focus on generic, vendor independent solutions.Edwards describes a generic architecture for storing XML documents in a relational database. The approach is aimed at avoiding vendor-specific database extensions and providing the database application programmer an opportunity to experiment with XML data storage without recourse to implementing much new technology. The database model is based on merging DOM with the Nested Sets Model hence offering the ability to store any well-formed XML document and ease of navigation. This results in fast serialization and querying but at the expense of update performance.While Edwards' architecture is aimed at supporting the traditional relational database programmer, Brown's approach seeks to exploit the advanced features offered by the object-relational model and respective extensions of most relational database systems. He discusses object-relational schema design based on introducing, into the DBMS core, types and operators equivalent to the ones standardized in XML. The key functionality required of the DBMS core is an extensible indexing system allowing the comparison operator for built-in SQL types to be overloaded. The new SQL 3 types thus defined act as a basis during the mapping of XPath expressions to SQL 3 queries over the schema.Part 4: Applications of XMLThis part presents several applications and case studies in XML data management ranging from bioinformatics, geographical and engineering data management to customer services and cash flow improvement through to large scale distributed systems, data warehouses and inductive database systems.In Chapter 10, Direen and Jones discuss various challenges in bioinformatics data management and the role of XML as a means to capture and express complex biological information. They argue that the flexible and extensible information model employed by XML is well suited for the purpose and that database technology must exhibit the same characteristics if it is keep in step with biological data management requirements. They discuss the role of NeoCore XML management system in this context and the integration of a BLAST (Basic Local Alignment Search Tool) sequence search engine to enhance its ability to capture, manipulate, analyze and grow the information pertaining to complex systems that make up living organisms.Kowalski presents two case studies involving XML and IBM's DB2 Universal Database in Chapter 11. Her first case study is that of a customer services unit which needs to react to the problems from the most important customers first. The second case study focuses on improving cash flow in a school by reducing the time for reimbursement from the Department of Education. For each case study the author presents the scenario and the particular problem to be solved. This is followed by an analysis identifying existing conditions stopping the problem to be solved. A description of how XML and DB2 have been used to devise an appropriate solution concludes each case study.Chapter 12, by Eglin, Hendra and Pentakalos, describes the design and implementation of the JEDMICS Open Access Interface, an EJB-based API that provides access to image data stored on a variety of storage media and meta-data stored in a relational database. The JEDMICS system uses XML as a portable data exchange solution and the authors discuss issues relating to its integration with the object-oriented core of the system and the relational database providing the persistent storage. A very interesting feature of the chapter is the authors' reflection on their experiences with a range of XML technologies such as DOM, JDOM, JAXB, XSLT and Oracle XSU in the context of JEDMICS.In Chapter 13, Wilson and her co-authors offer an insight into the use of XML to enhance the GIDB (Geospatial Information Database) system to exchange geographical data over the Internet. They describe the integration of meteorological and oceanographic data, received remotely via the METCAST system, into GIDB. XML plays a key role here as it is utilized to express the data model catalog for METCAST. The authors also describe their implementation of the OpenGIS Web Map Server (WMS) specification to facilitate displaying georeferenced map layers from multiple WMS-compliant servers. Another interesting feature of this chapter is the implementation of the ability to read and write vector data using the OpenGIS Geographic Markup Language (GML), an XML-based language standard for data interchange in Geographic Information Systems (GISs).Rine sketches his vision of an Interstellar Space Wide Web in Chapter 14. He contrasts the issues relating to the development and deployment of such a facility with the problems encountered in today's World Wide Web. Mainly, he focuses on adapters as configuration mechanisms for large scale, next generation distributed systems and means to increase the reusability of software components and architectures in this context. His approach to solving the problem is a configuration model and network-aware run-time environment called Space Wide Web Adapter Configuration eXtensible Markup Language (SWWACXML). The language associated with the environment captures component interaction properties and network-level QoS constraints. Adapters are automatically generated from the SWWACXML specifications. This facilitates reuse, as components are not tied to interactions or environments. Rine also discusses the role of the SWWACXML run-time system from this perspective as it supports automatic configuration and dynamic reconfiguration.In Chapter 15, Meo and Psaila present an XML-based data model used to bridge the gap between various analysis models and the constraints they place on data representation, retrieval and manipulation in inductive databases. The model called XDM (XML for Data Mining) allows simultaneous representation of source raw data and patterns. It also represents the pattern definition resulting from the pattern derivation process hence supporting pattern reuse by the inductive database system. One of the significant advantages of XML in this context is the ability to describe complex heterogeneous topologies such as trees and association rules. In addition, the inherent flexibility of XML makes it possible to extend the inductive database framework with new pattern models and data mining operators resulting in an open system customizable to the needs of the analyst.Chapter 16, the last chapter in this part, describes Baril's and Bellahsene's experiences in designing and managing an XML data warehouse. They propose the use of a view model and a graphical tool for the warehouse specification. Views defined in the warehouse allow filtering and restructuring of XML sources. The warehouse is defined as a set of materialized views and provides a mediated schema that constitutes a uniform query interface. They also discuss mapping techniques to store XML data using a relational database system without redundancies and with optimized storage space. Lastly, the DAWAX system implementing these concepts is presented.Part 5: Performance and BenchmarksXML database management systems face the same stringent efficiency and performance requirements as any other database technology. Therefore, the final part of this book is devoted to a discussion of benchmarks and performance analyses of such systems.Chapter 17 is driven by the need to design and adopt benchmarks to allow comparative performance analyses of the fast growing number of XML database management systems. Here Bressan and his colleagues describe three existing benchmarks for this purpose, namely XOO7, XMach-1 and XMark. They present the database and queries for each of the three benchmarks and compare them against four quality attributes: simplicity, relevance, portability and scalability. The discussion is aimed at identifying challenges facing the definition of a complete benchmark for XML database management systems.In Chapter 18, Patel and Jagadish describe a benchmark that is aimed at measuring lower-level operations than those described in Chapter 17. The inspiration for their work is the Wisconsin Benchmark that was used to measure the performance of relational database systems in the early 1980s.Schmauch and Fellhauer describe a detailed performance analysis in Chapter 19. They compare the time and space consumed by a range of XML data management approaches: relational databases, object-oriented databases, directory servers and native XML databases. XML documents are converted to DOM trees, hence reducing the problem to storing and extracting trees. Instead of using a particular benchmark they derive their test suite from general requirements that the storage of XML documents have to meet. Different sized XML documents are stored using the four types of systems, selected fragments and complete documents are extracted and the performance and disk space used is measured. Similar to the next chapter, Chapter 20, the authors offer a thorough set of empirical results. They also provide a detailed insight into existing XML data management approaches using the four systems analyzed. Finally, the experiences presented in the chapter are used as a basis to derive guidelines for benchmarking XML data management systems.In Chapter 20, Fong, Wong and Fong present a comparative performance analysis of a native XML database and a relational database extended with XML data management features. They do not use any existing benchmarks but instead devise their own methodology and database. The key contribution of this chapter is a detailed set of empirical results presented as bar graphs.Who Should Read This BookThis book is primarily aimed at professionals that are experienced in database technology and possibly XML and wish to learn how these two technologies can be used together. We hope to achieve this through discussions about alternative architectural approaches, case studies and performance benchmarks. Since the book is divided into a number of self-contained sections, it can also be used as a reference and only the relevant sections that the reader is interested-in can be read. The book may also be useful to students studying on advanced database courses.