Classification of documents by form and content
Pattern Recognition Letters - special issue on pattern recognition in practice V
Automatic caption localization for photographs on World Wide Web pages
Information Processing and Management: an International Journal
Understanding captions in biomedical publications
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Capturing the Layout of Electronic Documents for Reuse in Variable Data Printing
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Hi-index | 0.00 |
The Portable Document Format (PDF) is widely-used in the Web and searchable by search engines, but only for the text content. The goal of this work is the extraction and annotation of images in PDF-documents, to make them searchable and to perform semantic image annotation. The first step is the extraction and conversion of the images into a standard format like jpeg, and the recognition of corresponding image captions using the layout structure and geometric relationships. The second step uses linguistic-semantic analysis of the image caption text in the context of the document domain. The result on a PDF-document collection with about 3300 pages with 6500 images has a precision of 95.5% and a recall of 88.8% for the correct image captions.