Representation of hypertext documents based on terms, links and text compressibility

  • Authors:
  • Julian Szymański;Włodzisław Duch

  • Affiliations:
  • Department of Computer Systems Architecture, Gdańsk University of Technology, Poland;Department of Informatics, Nicolaus Copernicus University, Toruń, Poland and School of Computer Engineering, Nanyang Technological University, Singapore

  • Venue:
  • ICONIP'10 Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Three methods for representation of hypertext based on links, terms and text compressibility have been compared to check their usefulness in document classification. Documents for classification have been selected from the Wikipedia articles taken from five distinct categories. For each representation dimensionality reduction by Principal Component Analysis has been performed, providing rough visual presentation of the data. Compression-based feature space representation needed about 5 times less PCA vectors than the term or link-based representations to reach 90% cumulative variance, giving comparable results of classification by Support Vector Machines.