Fixing weakly annotated web data using relational models

  • Authors:
  • Fatih Gelgi;Srinivas Vadrevu;Hasan Davulcu

  • Affiliations:
  • Department of Computer Science and Engineering, Arizona State University, Tempe, AZ;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ

  • Venue:
  • ICWE'07 Proceedings of the 7th international conference on Web engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.