Fixing weakly annotated web data using relational models

Authors:
Fatih Gelgi;Srinivas Vadrevu;Hasan Davulcu
Affiliations:
Department of Computer Science and Engineering, Arizona State University, Tempe, AZ;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ
Venue:
ICWE'07 Proceedings of the 7th international conference on Web engineering
Year:
2007

Citing 8
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Learning Probabilistic Relational Models

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Iterative Bayes

Theoretical Computer Science
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites

Journal of the ACM (JACM)
Introduction to Machine Learning (Adaptive Computation and Machine Learning)

Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Managing Semi-Structured Data

Queue - Semi-structured Data
Semantic partitioning of web pages

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.