Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras

Authors:
Linda Smith;Hong Cui
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras
Year:
2005

Citing 0
Cited 2

The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions

Journal of the American Society for Information Science and Technology
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this dissertation research, we proposed and evaluated a machine learning based automatic approach to semantic markup of semi-structured domain text with XML. As a case study we used plant morphological descriptions as the target text. We tested the hypothesis that domain knowledge learned from semi-structured plant descriptions helps to significantly improve the markup performance of less structured plant descriptions. Three collections of plant descriptions were extracted from Flora of North America (FNA), Flora of China (FOC), and Flora of North Central Texas (FNCT). Via a set of structuredness measures we developed, the FNA and FOC collections were selected as the base corpora and the FNCT collection was selected as the test corpus. A number of markup algorithms were evaluated on the three collections and the best markup algorithms were identified. One of the best algorithms was used to mark up the base corpora. Two types of domain knowledge, knowledge on semantic classes of n-grams and knowledge on relative positions of elements, were mined from the marked-up base corpora. The usefulness of the two types of induced knowledge was examined by comparing the markup performance of the algorithm given different access to the induced knowledge base. Major findings include (1) Machine learning algorithms that are tailored to make use of special characteristics of the domain text have the best performance on all three collections of descriptions and with different granularities of markup. (2) The induced knowledge on the semantic classes of n-grams helps to significantly (α = 0.05) improve markup performance on FNCT descriptions. (3) In general, the induced knowledge on the semantic classes of n-grams is a more reliable knowledge source than that learned from training examples. (4) The induced knowledge on relative positions of elements seems not to be as useful, especially when the variations in element sequences within and across collections become larger. (5) There is an interaction between the two types of knowledge. When knowledge on the semantic classes is of higher quality, the use of knowledge on relative positions seems to be more beneficial.**This dissertation is a compound document (contains both a paper copy and a CD as part of the dissertation).