Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents

Authors:
Mouhamadou Thiam;Nacéra Bennacer;Nathalie Pernelle;Moussa Lô
Affiliations:
LRI, Université Paris-Sud 11, INRIA Saclay Ile de France, Orsay Cedex, France F-91893 and LANI, Université Gaston Berger, UFR S.A.T, Saint-Louis, Sénégal;SUPELEC, Gif-sur-Yvette cedex, France F-91192;LRI, Université Paris-Sud 11, INRIA Saclay Ile de France, Orsay Cedex, France F-91893;LANI, Université Gaston Berger, UFR S.A.T, Saint-Louis, Sénégal
Venue:
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Year:
2009

Citing 6
Cited 2

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites

Computational Linguistics
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence

Supporting semantic search on heterogeneous semi-structured documents

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Controlled knowledge base enrichment from web documents

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

SHIRI is an ontology-based system for integration of semi-structured documents related to a specific domain. The system's purpose is to allow users to access to relevant parts of documents as answers to their queries. SHIRI uses RDF/OWL for representation of resources and SPARQL for their querying. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and semantic annotation of tagged elements of documents. In this paper, we focus on the Extract-Align algorithm which exploits a set of named entity and term patterns to extract term candidates to be aligned with the ontology. It proceeds in an incremental manner in order to populate the ontology with terms describing instances of the domain and to reduce the access to extern resources such as Web. We experiment it on a HTML corpus related to call for papers in computer science and the results that we obtain are very promising. These results show how the incremental behaviour of Extract-Align algorithm enriches the ontology and the number of terms (or named entities) aligned directly with the ontology increases.