Combining multiple sources of evidence in web information extraction

Authors:
Martin Labský;Vojtěch Svátek
Affiliations:
Department of Information and Knowledge Engineering, University of Economics, Praha 3, Czech Republic;Department of Information and Knowledge Engineering, University of Economics, Praha 3, Czech Republic
Venue:
ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Year:
2008

Citing 6
Cited 0

Machine Learning for Sequential Data: A Review

Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Table extraction for answer retrieval

Information Retrieval
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Semantic annotation, indexing, and retrieval

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define extraction subtasks to be carried out via trainable classifiers. Elements of an extraction ontology can be endowed with probability estimates, which are used for selection and ranking of attribute and instance candidates to be extracted. At the same time, HTML formatting regularities are locally exploited.