OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements

  • Authors:
  • Hagit Shatkay;Ramya Narayanaswamy;Santosh S. Nagaral;Na Harrington;Rohith Mv;Gowri Somanath;Ryan Tarpine;Kyle Schutter;Tim Johnstone;Dorothea Blostein;Sorin Istrail;Chandra Kambhamettu

  • Affiliations:
  • University of Delaware, Newark, DE and Queen's University, Kingston, Ontario, CA;University of Delaware, Newark, DE;University of Delaware, Newark, DE;Queen's University, Kingston, Ontario, CA;University of Delaware, Newark, DE;University of Delaware, Newark, DE;Brown University, Providence, RI;Brown University, Providence, RI;Brown University, Providence, RI;Queen's University, Kingston, Ontario, CA;Brown University, Providence, RI;University of Delaware, Newark, DE

  • Venue:
  • Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles. In the work presented here we focus on a different literature classification task, motivated by the need to identify articles discussing cis-regulatory elements and modules in the context of understanding complex gene-networks. The curators who try to identify such articles in the vast literature use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (like those mentioned above) can be highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, allows us to form a novel representation of images, and identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of such DNA-rich images within articles, we train a classifier that identifies articles pertaining to cis-regulatory elements with a similarly high precision and recall. The use of OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, the ability to automatically identify such images has much potential to be widely applicable in other important biomedical document classification tasks.