The difficulties of taxonomic name extraction and a solution

  • Authors:
  • Guido Sautter;Klemens Böhm

  • Affiliations:
  • Universität Karlsruhe (TH), Germany;Universität Karlsruhe (TH), Germany

  • Venue:
  • BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In modern biology, digitization of biosystematics publications is an important task. Extraction of taxonomic names from such documents is one of its major issues. This is because these names identify the various genera and species. This article reports on our experiences with learning techniques for this particular task. We say why established Named-Entity Recognition techniques are somewhat difficult to use in our context. One reason is that we have only very little training data available. Our experiments show that a combining approach that relies on regular expressions, heuristics, and word-level language recognition achieves very high precision and recall and allows to cope with those difficulties.