An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

Authors:
Luis D. Lopez;Jingyi Yu;Cecilia N. Arighi;Hongzhan Huang;Hagit Shatkay;Cathy Wu
Affiliations:
-;-;-;-;-;-
Venue:
BIBM '11 Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine
Year:
2011

Citing 0
Cited 3

Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
A figure search engine architecture for a chemistry digital library

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
An Image-Text Approach for Extracting Experimental Evidence of Protein-Protein Interactions in the Biomedical Literature

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Figures in biomedical articles often constitute direct evidence of experimental results. Image analysis methods can be coupled with text-based methods to improve knowledge discovery. However, automatically harvesting figures along with their associated captions from full-text articles remains challenging. In this paper, we present an automatic system for robustly harvesting figures from biomedical literature. Our approach relies on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. This allows us to harvest fragments of figures (subfigures), from the PDF, correctly identify subfigures that belong to the same figure, and identify the captions associated with each figure. Our method simultaneously recovers figures and captions and applies additional filtering process to remove irrelevant figures such as logos, to eliminate text passages that were incorrectly identified as captions, and to re-group subfigures to generate a putative figure. Finally, we associate figures with captions. Our preliminary experiments suggest that our method achieves an accuracy of 95% in harvesting figures-caption pairs from a set of 2, 035 full-text biomedical documents from Bio Creative III, containing 12, 574 figures.