Extracting Document Structure to Facilitate a Knowledge Base Creation for The UML Superstructure Specification

Authors:
Mehrdad Nojoumian;Timothy C. Lethbridge
Affiliations:
University of Ottawa;University of Ottawa
Venue:
ITNG '07 Proceedings of the International Conference on Information Technology
Year:
2007

Citing 0
Cited 1

Reengineering PDF-based documents targeting complex software specifications

International Journal of Knowledge and Web Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The research presented in this paper aims at facilitating the creation of knowledge bases (KBs) for software specifications, of which the UML superstructure specification is our initial target. Our motivation is that such specifications are dense, repetitive and difficult to use. They are written primarily in semi-structured text, but the structure must be maintained manually as they are edited, resulting in inconsistency. End users cannot use them efficiently because of the duplications, numerous concepts connected only implicitly, and general complexity of the document. Our immediate objective is to generate a KB for the UML specification by extracting knowledge from as many sources as possible in the document such as document structure, embedded natural language, as well as implicit and explicit cross references. In this paper our focus is the first step: extraction of the document's logical structure. Many key concepts of a document are expressed in this structure, which includes the headings of the chapters, sections, subsections, etc. By extracting such a structure in XML format, we can form a good infrastructure for the subsequent KB creation steps.