A logic-based tool for semantic information extraction

Authors:
Massimo Ruffolo;Marco Manna;Lorenzo Gallucci;Nicola Leone;Domenico Saccà
Affiliations:
Exeura s.r.l.;Department of Mathematics, University of Calabria, Arcavacata di Rende (CS), Italy;Exeura s.r.l.;Exeura s.r.l.;Exeura s.r.l.
Venue:
JELIA'06 Proceedings of the 10th European conference on Logics in Artificial Intelligence
Year:
2006

Citing 5
Cited 2

Two-dimensional languages

Handbook of formal languages, vol. 3
A brief survey of web data extraction tools

ACM SIGMOD Record
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto

LPNMR '01 Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning
Toolkits for Generating Wrappers

NODe '02 Revised Papers from the International Conference NetObjectDays on Objects, Components, Architectures, Services, and Applications for a Networked World
The DLV system for knowledge representation and reasoning

ACM Transactions on Computational Logic (TOCL)

The DLV Project: A Tour from Theory and Research to Applications and Market

ICLP '08 Proceedings of the 24th International Conference on Logic Programming
On the complexity of regular-grammars with integer attributes

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recognizing and extracting meaningful information from unstructured Web documents, taking into account their semantics, is an important problem in information and knowledge management. This paper describes H$\imath$LεX, a system implementing a novel logic-based approach to information extraction from unstructured documents. The approach adopted in the H$\imath$LεX system is founded on a new two-dimensional representation of documents, and heavily exploits DLP+ – an extension of disjunctive logic programming for ontology representation and reasoning, which has been recently implemented on top of the DLV system. Unlike previous systems, which are mainly syntactic, H$\imath$LεX combines both semantic and syntactic knowledge for a powerful information extraction. Ontologies, representing the semantics of the domain of the information to be extracted, are encoded in DLP+, while the extraction patterns are encoded by regular expressions in an ad hoc two-dimensional grammar. These regular expressions are (internally) translated into DLP+ rules, whose execution yields the actual extraction of information from the input document. H$\imath$LεX allows the semantic information extraction from both HTML pages and flat text documents. The usefulness of Hilex has been already confirmed also in practice, as the system has been successfully employed in two advanced applications in the e-health and e-finance domains.