Fuzzy cardinals based on the generalized equality of fuzzy subsets
Fuzzy Sets and Systems
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
One-unambiguous regular languages
Information and Computation
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains
Machine Learning - Special issue on information retrieval
A brief survey of web data extraction tools
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
A wrapper generation system for PDF documents
Proceedings of the 2008 ACM symposium on Applied computing
Towards a System for Ontology-Based Information Extraction from PDF Documents
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Hi-index | 0.00 |
The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.