Database support for species extraction from the biosystematics literature: a feasibility demonstration

  • Authors:
  • Ralf Duckstein;Klemens Böhm

  • Affiliations:
  • Otto-von-Guericke-University Magdeburg, Magdeburg, Germany;Otto-von-Guericke-University Magdeburg, Magdeburg, Germany

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

A part of the biosystematics literature is currently being digitized and manually marked up with XML. Fast search on such documents shall be feasible. But marking up such documents incurs high costs, and biologists would like to know the value of such an activity in advance. Deploying standard XML database technology in a straightforward way is not feasible, because of two characteristics of biosystematics documents. The first one is that descriptions of taxa are related, i.e., a more specific taxon should inherit from a more general one. The combination of inheritance with information-retrieval mechanisms gives rise to difficulties addressed in this article. The second issue is the frequent occurrence of very specific technical terms in such documents, i.e., geographical information or biological terms. To investigate the characteristics of the search in the presence of such difficulties, we have designed and implemented a respective system, based on relational database technology. We use a collection of XML documents that mimics the characteristics of biosystematics documents, as we will explain. We propose two query-evaluation alternatives and compare them by means of performance experiments. It turns out that our techniques can administer the envisioned corpus of documents efficiently and cope with those problems at the same time.