OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements

Authors:
Hagit Shatkay;Ramya Narayanaswamy;Santosh S. Nagaral;Na Harrington;Rohith Mv;Gowri Somanath;Ryan Tarpine;Kyle Schutter;Tim Johnstone;Dorothea Blostein;Sorin Istrail;Chandra Kambhamettu
Affiliations:
University of Delaware, Newark, DE and Queen's University, Kingston, Ontario, CA;University of Delaware, Newark, DE;University of Delaware, Newark, DE;Queen's University, Kingston, Ontario, CA;University of Delaware, Newark, DE;University of Delaware, Newark, DE;Brown University, Providence, RI;Brown University, Providence, RI;Brown University, Providence, RI;Queen's University, Kingston, Ontario, CA;Brown University, Providence, RI;University of Delaware, Newark, DE
Venue:
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Year:
2012

Citing 11
Cited 0

An algorithm for suffix stripping

Readings in information retrieval
Digital Image Processing

Digital Image Processing
Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1)

ACM SIGKDD Explorations Newsletter
Searching Online Journals for Fluorescence Microscope Images Depicting Protein Subcellular Location Patterns

BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Integrating image data into biomedical text categorization

Bioinformatics
Exploring a new space of features for document classification: figure clustering

CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Improved recognition of figures containing fluorescence microscope images in online journal articles using graphical models

Bioinformatics
Exploring text and image features to classify images in bioscience literature

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Figure mining for biomedical research

Bioinformatics
Toward computer-assisted text curation: classification is easy (choosing training data can be hard...)

ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles. In the work presented here we focus on a different literature classification task, motivated by the need to identify articles discussing cis-regulatory elements and modules in the context of understanding complex gene-networks. The curators who try to identify such articles in the vast literature use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (like those mentioned above) can be highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, allows us to form a novel representation of images, and identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of such DNA-rich images within articles, we train a classifier that identifies articles pertaining to cis-regulatory elements with a similarly high precision and recall. The use of OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, the ability to automatically identify such images has much potential to be widely applicable in other important biomedical document classification tasks.