A preliminary look into the use of named entity information for bioscience text tokenization

Authors:
Robert Arens
Affiliations:
University of Iowa, Iowa City, Iowa
Venue:
HLT-SRWS '04 Proceedings of the Student Research Workshop at HLT-NAACL 2004
Year:
2004

Citing 8
Cited 0

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
The Hierarchical Hidden Markov Model: Analysis and Applications

Machine Learning
Rutabaga by any other name: extracting biological names

Journal of Biomedical Informatics - Special issue: Sublanguage
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Tagging gene and protein names in full text articles

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Protein name tagging for biomedical annotation in text

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty. This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation classification motivates using machine learning methods in the implementation of this system. The evaluation of BAccHANT's performance included error analysis of the system's performance inside and outside of named entities (NEs) from the GENIA corpus, which led to the creation of a normalization system trained solely on data from inside NEs, BAccHANT-N. Evaluation of this new system indicated that normalization systems trained on data inside NEs perform better than systems trained both inside and outside NEs, motivating a merging of tokenization and named entity tagging processes as opposed to the standard pipelining approach.