Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
One-unambiguous regular languages
Information and Computation
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains
Machine Learning - Special issue on information retrieval
A brief survey of web data extraction tools
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Similarity of Cardinal Directions
SSTD '01 Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases
Composing cardinal direction relations
Artificial Intelligence
Wrapping PDF documents exploiting uncertain knowledge
CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering
Hi-index | 0.00 |
The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.