Towards a System for Ontology-Based Information Extraction from PDF Documents

Authors:
Ermelinda Oro;Massimo Ruffolo
Affiliations:
Department of Computer Science and System Science (DEIS),;Institute of High Performance Computing and Networking of CNR (ICAR-CNR), University of Calabria, Rende (CS), Italy 87036
Venue:
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Year:
2008

Citing 26
Cited 0

Towards a theory of declarative knowledge

Foundations of deductive databases and logic programming
Logical foundations of object-oriented and frame-based languages

Journal of the ACM (JACM)
Managing semistructured data with florid: a deductive object-oriented perspective

Information Systems - Special issue on semistructured data
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A brief survey of web data extraction tools

ACM SIGMOD Record
Introduction to automata theory, languages, and computation, 2nd edition

ACM SIGACT News
International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Complexity and Expressive Power of Logic Programming

CCC '97 Proceedings of the 12th Annual IEEE Conference on Computational Complexity
Bootstrapping an ontology-based information extraction system

Intelligent exploration of the web
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

IEEE Transactions on Knowledge and Data Engineering
Intelligent Text Extraction from PDF Documents

CIMCA '05 Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06) - Volume 02
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Ontology-based information extraction for business intelligence

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Ontology-driven information extraction with ontosyphon

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Text2Onto: a framework for ontology learning and data-driven change discovery

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Wrapping PDF documents exploiting uncertain knowledge

CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering
The lixto project: exploring new frontiers of web data extraction

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ontologies enable to directly encode domain knowledge in software applications, so ontology-based systems can exploit the meaning of information for providing advanced and intelligent functionalities. One of the most interesting and promising application of ontologies is information extraction from unstructured documents. In this area the extraction of meaningful information from PDF documents has been recently recognized as an important and challenging problem. This paper proposes an ontology-based information extraction system for PDF documents founded on a well suited knowledge representation approach named self-populating ontology (SPO ). The SPO approach combines object-oriented logic-based features with formal grammar capabilities and allows expressing knowledge in term of ontology schemas, instances, and extraction rules (called descriptors ) aimed at extracting information having also tabular form. The novel aspect of the SPO approach is that it allows to represent ontologies enriched by rules that enable them to populate them-self with instances extracted from unstructured PDF documents. In the paper the tractability of the SPO approach is proven. Moreover, features and behavior of the prototypical implementation of the SPO system are illustrated by means of a running example.