A preliminary look into the use of named entity information for bioscience text tokenization

  • Authors:
  • Robert Arens

  • Affiliations:
  • University of Iowa, Iowa City, Iowa

  • Venue:
  • HLT-SRWS '04 Proceedings of the Student Research Workshop at HLT-NAACL 2004
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty. This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation classification motivates using machine learning methods in the implementation of this system. The evaluation of BAccHANT's performance included error analysis of the system's performance inside and outside of named entities (NEs) from the GENIA corpus, which led to the creation of a normalization system trained solely on data from inside NEs, BAccHANT-N. Evaluation of this new system indicated that normalization systems trained on data inside NEs perform better than systems trained both inside and outside NEs, motivating a merging of tokenization and named entity tagging processes as opposed to the standard pipelining approach.