Analysis of Document Structures for Element Type Classification

  • Authors:
  • Helena Ahonen;Barbara Heikkinen;Oskari Heinonen;Jani Jaakkola;Mika Klemettinen

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • PODDP '98 Proceedings of the 4th International Workshop on Principles of Digital Document Processing
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

As more and more digital documents become available for the public use from different sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration. We address this problem by introducing a classification method that acquires knowledge from document instances and their document type definitions, and uses this knowledge to attach a generic class to each SGML element type. The classification retains the tree hierarchy of elements. Although the structure is simplified, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type definition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-defined generic DTD. The specific usages of the classification, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type definitions. We have implemented the classification method and experimented with several document types.