Extracting and matching authors and affiliations in scholarly documents
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
A figure search engine architecture for a chemistry digital library
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Hi-index | 0.00 |
Figures in biomedical articles often constitute direct evidence of experimental results. Image analysis methods can be coupled with text-based methods to improve knowledge discovery. However, automatically harvesting figures along with their associated captions from full-text articles remains challenging. In this paper, we present an automatic system for robustly harvesting figures from biomedical literature. Our approach relies on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. This allows us to harvest fragments of figures (subfigures), from the PDF, correctly identify subfigures that belong to the same figure, and identify the captions associated with each figure. Our method simultaneously recovers figures and captions and applies additional filtering process to remove irrelevant figures such as logos, to eliminate text passages that were incorrectly identified as captions, and to re-group subfigures to generate a putative figure. Finally, we associate figures with captions. Our preliminary experiments suggest that our method achieves an accuracy of 95% in harvesting figures-caption pairs from a set of 2, 035 full-text biomedical documents from Bio Creative III, containing 12, 574 figures.