Toward computer-assisted text curation: classification is easy (choosing training data can be hard...)

Authors:
Robert Denroche;Ramana Madupu;Shibu Yooseph;Granger Sutton;Hagit Shatkay
Affiliations:
Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, Ontario, Canada;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Informatics Department, J. Craig Venter Institute, Rockville, Maryland, United States;Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, Ontario, Canada
Venue:
ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology
Year:
2009

Citing 3
Cited 1

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles

ACM SIGKDD Explorations Newsletter
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)

OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

We aim to design a system for classifying scientific articles based on the presence of protein characterization experiments, intending to aid the curators populating JCVI's Characterized Protein (CHAR) Database of experimentally characterized proteins. We trained two classifiers using small datasets labeled by CHAR curators, and another classifier based on a much larger dataset using annotations from public databases. Performance varied greatly, in ways we did not anticipate. We describe the datasets, the classification method, and discuss the unexpected results.