Categorisation of web documents using extraction ontologies

  • Authors:
  • Li Xu;David W. Embley

  • Affiliations:
  • Department of Computer Science, University of Arizona South, 1140 N Colombo Ave., Sierra Vista, AZ 85635, USA.;Department of Computer Science, Brigham Young University, Provo, Utah, USA

  • Venue:
  • International Journal of Metadata, Semantics and Ontologies
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically recognising which HTML documents on the Webcontain items of interest for a user is non-trivial. As a steptoward solving this problem, we propose an approach based oninformation-extraction ontologies. Given HTML documents, tables,and forms, our document recognition system extracts expectedontological vocabulary (keywords and keyword phrases) and expectedontological instance data (particular values for ontologicalconcepts). We then use machine-learned rules over this extractedinformation to determine whether an HTML document contains items ofinterest. Experimental results show that our ontological approachto categorisation works well, having achieved F-measures above 90%for all applications we tried.