Extracting anchorable information units from PDF files

  • Authors:
  • A. Chakraborty;P. Liu;L. Hsu

  • Affiliations:
  • Siemens Corp. Res. Inc., Princeton, NJ, USA;Siemens Corp. Res. Inc., Princeton, NJ, USA;Siemens Corp. Res. Inc., Princeton, NJ, USA

  • Venue:
  • ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can create an electronic document. This paper describes a novel method for extracting anchorable information units (AIUs), also known as hotspots from any type of portable document format (PDF) files, which may either be created using either an editor or by scanning in documents. The AIUs are used to make these documents more intelligent for content cross-referencing to/from related multimedia documents within an electronic document publishing environment. Domain specific knowledge about the documents are used to aid the extraction process. Once the location and extent of the texts are found, the content is extracted through the use of an optical character recognition (OCR) software if necessary. For the case of object extraction for highlighting, first the images are extracted and then a variety of image processing algorithms are applied.