A framework for populating ontological models from semi-structured web documents

Authors:
Hassan A. Sleiman;Inma Hernández
Affiliations:
University of Sevilla, Spain;University of Sevilla, Spain
Venue:
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Year:
2012

Citing 9
Cited 0

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
DEByE - Date extraction by example

Data & Knowledge Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
FiVaTech: Page-Level Web Data Extraction from Template Pages

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is the largest repository of information that has ever existed. This information is presented in a human friendly format using HTML, which complicates the consumption of this information by automatic processes. Solutions to this problem are the Semantic Web and Web Services, but the lack of such services in the majority of web sites has increased the interest on information extraction, which allow extracting and structuring information from web documents in ontological models. Despite the high number of proposals on information extraction, there does not exist a universally applicable information extractor. As a consequence, when populating an ontology model automatically from a web site, it is not unusual to need more than one information extractor. We propose a framework that allows the development, training, and the application of information extractors on semi-structured web documents to produce semantic data. We have developed a version of the framework and verified it by means of experiments on 15 web sites. Experimental results are very promising.