Use of Figures in Literature Mining for Biomedical Digital Libraries

  • Authors:
  • Nawei Chen;Hagit Shatka;Dorothea Blostein

  • Affiliations:
  • Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada

  • Venue:
  • DIAL '06 Proceedings of the Second International Conference on Document Image Analysis for Libraries
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The maintenance of biomedical digital libraries (including organism databases and protein databases) involves analysis of a large number of documents. Much work is done manually: curators study large numbers of biomedical documents while updating and annotating organism databases such as MGI (Mouse Genome Informatics) and Flybase (a database of the fruit-fly genome). We summarize the annotation process in organism databases, and describe some of the roles played by the Gene Ontology and by document databases such as PubMed. Efforts are ongoing to automate parts of the annotation process. Biomedical text mining contests, such as the TREC Genomics Track [6, 71, define annotation subtasks, and provide training and test data. So far, these efforts have focused on the analysis of the text content of documents. We are investigating the analysis of figures in biomedical documents; the information derived fiom figure analysis may later be combined with the information derived from text analysis. We present an algorithm for using figures in document triage; triage involves determining which documents are relevant to a given annotation task. In our triage algorithm, we segment figures into subfigures and classify the subfigures us Graphical, Gel, Fluorescence Microscopy, and Other Microscopy. A secondary classfication into subcategories is performed by clustering, using clusters created from the subfigures in the labeled traing data. The classfication of all subfigures in a document are combined to form a documentt descriptor. The document descriptor is then classified using a Naive Bayes Classifier, as either relevant or irrelevant to the given annotation task.