Toward computer-assisted text curation: classification is easy (choosing training data can be hard...)

  • Authors:
  • Robert Denroche;Ramana Madupu;Shibu Yooseph;Granger Sutton;Hagit Shatkay

  • Affiliations:
  • Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, Ontario, Canada;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, Ontario, Canada

  • Venue:
  • ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We aim to design a system for classifying scientific articles based on the presence of protein characterization experiments, intending to aid the curators populating JCVI's Characterized Protein (CHAR) Database of experimentally characterized proteins. We trained two classifiers using small datasets labeled by CHAR curators, and another classifier based on a much larger dataset using annotations from public databases. Performance varied greatly, in ways we did not anticipate. We describe the datasets, the classification method, and discuss the unexpected results.