Wrapping PDF documents exploiting uncertain knowledge

Authors:
S. Flesca;S. Garruzzo;E. Masciari;A. Tagarelli
Affiliations:
DEIS, University of Calabria;DIMET, University of Reggio Calabria;ICAR-CNR – Institute of Italian National Research Council;DEIS, University of Calabria
Venue:
CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering
Year:
2006

Citing 9
Cited 2

Fuzzy cardinals based on the generalized equality of fuzzy subsets

Fuzzy Sets and Systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
One-unambiguous regular languages

Information and Computation
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases

A wrapper generation system for PDF documents

Proceedings of the 2008 ACM symposium on Applied computing
Towards a System for Ontology-Based Information Extraction from PDF Documents

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.