An algorithm for suffix stripping
Readings in information retrieval
Digital Image Processing
Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1)
ACM SIGKDD Explorations Newsletter
BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Integrating image data into biomedical text categorization
Bioinformatics
Exploring a new space of features for document classification: figure clustering
CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Exploring text and image features to classify images in bioscience literature
BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Figure mining for biomedical research
Bioinformatics
ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology
Hi-index | 0.00 |
Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles. In the work presented here we focus on a different literature classification task, motivated by the need to identify articles discussing cis-regulatory elements and modules in the context of understanding complex gene-networks. The curators who try to identify such articles in the vast literature use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (like those mentioned above) can be highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, allows us to form a novel representation of images, and identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of such DNA-rich images within articles, we train a classifier that identifies articles pertaining to cis-regulatory elements with a similarly high precision and recall. The use of OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, the ability to automatically identify such images has much potential to be widely applicable in other important biomedical document classification tasks.