PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Management of semistructured data
ACM SIGMOD Record
Inferring structure in semistructured data
ACM SIGMOD Record
ICDT '97 Proceedings of the 6th International Conference on Database Theory
A Semantic Network Approach to Semi-Structured Documents Repositories
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Design and Implementation of a Document Assembly Workbench
EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography
Hi-index | 0.00 |
As more and more digital documents become available for the public use from different sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration. We address this problem by introducing a classification method that acquires knowledge from document instances and their document type definitions, and uses this knowledge to attach a generic class to each SGML element type. The classification retains the tree hierarchy of elements. Although the structure is simplified, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type definition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-defined generic DTD. The specific usages of the classification, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type definitions. We have implemented the classification method and experimented with several document types.