Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras

  • Authors:
  • Linda Smith;Hong Cui

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this dissertation research, we proposed and evaluated a machine learning based automatic approach to semantic markup of semi-structured domain text with XML. As a case study we used plant morphological descriptions as the target text. We tested the hypothesis that domain knowledge learned from semi-structured plant descriptions helps to significantly improve the markup performance of less structured plant descriptions. Three collections of plant descriptions were extracted from Flora of North America (FNA), Flora of China (FOC), and Flora of North Central Texas (FNCT). Via a set of structuredness measures we developed, the FNA and FOC collections were selected as the base corpora and the FNCT collection was selected as the test corpus. A number of markup algorithms were evaluated on the three collections and the best markup algorithms were identified. One of the best algorithms was used to mark up the base corpora. Two types of domain knowledge, knowledge on semantic classes of n-grams and knowledge on relative positions of elements, were mined from the marked-up base corpora. The usefulness of the two types of induced knowledge was examined by comparing the markup performance of the algorithm given different access to the induced knowledge base. Major findings include (1) Machine learning algorithms that are tailored to make use of special characteristics of the domain text have the best performance on all three collections of descriptions and with different granularities of markup. (2) The induced knowledge on the semantic classes of n-grams helps to significantly (α = 0.05) improve markup performance on FNCT descriptions. (3) In general, the induced knowledge on the semantic classes of n-grams is a more reliable knowledge source than that learned from training examples. (4) The induced knowledge on relative positions of elements seems not to be as useful, especially when the variations in element sequences within and across collections become larger. (5) There is an interaction between the two types of knowledge. When knowledge on the semantic classes is of higher quality, the use of knowledge on relative positions seems to be more beneficial.**This dissertation is a compound document (contains both a paper copy and a CD as part of the dissertation).