Classification of XSLT-Generated web documents with support vector machines

  • Authors:
  • Atakan Kurt;Engin Tozal

  • Affiliations:
  • Computer Eng. Dept., Fatih University, Istanbul, Turkey;Computer Eng. Dept., Fatih University, Istanbul, Turkey

  • Venue:
  • KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

XSLT is a transformation language mainly used for converting XML documents to HTML or other formats. Due to its simplicity and flexibility XML has replaced traditional EDI file formats. Most e-business applications store data in XML, convert XML into HTML using XSTL, and publish the HTML documents to the web. In this paper we argue that the use of XSLT presents an opportunity rather than a challenge to web document classification. We show that it is possible to combine the advantages of both HTML and XML into classification of documents at the XSLT transformation stage, named XSLT classification, to attain higher classification rates using Support Vector Machines (SVM). The results are both expected and promising. We believe that XSLT classification can become a favorable classification method over HTML or XML classification where XSLT stylesheets are available.