Automatic acquisition of huge training data for bio-medical named entity recognition

  • Authors:
  • Yu Usami;Han-Cheol Cho;Naoaki Okazaki;Jun'ichi Tsujii

  • Affiliations:
  • The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;Tohoku University, Sendai, Japan;Microsoft Research Asia, Beijing, China

  • Venue:
  • BioNLP '11 Proceedings of BioNLP 2011 Workshop
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Named Entity Recognition (NER) is an important first step for BioNLP tasks, e.g., gene normalization and event extraction. Employing supervised machine learning techniques for achieving high performance recent NER systems require a manually annotated corpus in which every mention of the desired semantic types in a text is annotated. However, great amounts of human effort is necessary to build and maintain an annotated corpus. This study explores a method to build a high-performance NER without a manually annotated corpus, but using a comprehensible lexical database that stores numerous expressions of semantic types and with huge amount of unannotated texts. We underscore the effectiveness of our approach by comparing the performance of NERs trained on an automatically acquired training data and on a manually annotated corpus.