Wrapping PDF documents exploiting uncertain knowledge

  • Authors:
  • S. Flesca;S. Garruzzo;E. Masciari;A. Tagarelli

  • Affiliations:
  • DEIS, University of Calabria;DIMET, University of Reggio Calabria;ICAR-CNR – Institute of Italian National Research Council;DEIS, University of Calabria

  • Venue:
  • CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.