GeneTUC, GENIA and google: natural language understanding in molecular biology literature

  • Authors:
  • Rune Sætre;Harald Søvik;Tore Amble;Yoshimasa Tsuruoka

  • Affiliations:
  • Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway;Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway;Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway;Department of Computer Science, University of Tokyo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • Transactions on Computational Systems Biology V
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. GeneTUC has been developed to be able to read biological texts and answer questions about them afterwards. The knowledge base of the system is constructed by parsing MEDLINE abstracts or other online text strings retrieved by the Google API. When the system encounters words that are not in the dictionary, the Google API can be used to automatically determine the semantic class of the word and add it to the dictionary. The performance of the GeneTUC parser was tested and compared to the manually tagged GENIA corpus with EvalB, giving bracketing precision and recall scores of 70,6% and 53,9% respectively. GeneTUC was able to parse 60,2% of the sentences, and the POS-tagging accuracy was 86.0%. This is not as high as the best taggers and parsers available, but GeneTUC is also capable of doing deep reasoning, like anaphora resolution and question answering, which is not a part of the state-of-the-art parsers.