Classification of web documents using concept extraction from ontologies

  • Authors:
  • Marina Litvak;Mark Last;Slava Kisilevich

  • Affiliations:
  • Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel;Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel;Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel

  • Venue:
  • AIS-ADM'07 Proceedings of the 2nd international conference on Autonomous intelligent systems: agents and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.