A wrapper generation system for PDF documents

Authors:
Bettina Fazzinga;Sergio Flesca;Andrea Tagarelli;Salvatore Garruzzo;Elio Masciari
Affiliations:
DEIS-UNICAL, Rende, Italy;DEIS-UNICAL, Rende, Italy;DEIS-UNICAL, Rende, Italy;DIMET-UNIRC, Reggio Calabria, Italy;ICAR-CNR, Rende, Italy
Venue:
Proceedings of the 2008 ACM symposium on Applied computing
Year:
2008

Citing 11
Cited 0

Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
One-unambiguous regular languages

Information and Computation
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Similarity of Cardinal Directions

SSTD '01 Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases
Composing cardinal direction relations

Artificial Intelligence
Wrapping PDF documents exploiting uncertain knowledge

CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.